Kristian Hermsdorf wrote:

We're using pdftotext as well, because PDFbox ist really slow. If your application should work under Windows you will probably experiance some mystic Java-VM crashes while executing external processes in batch-mode. (This is because of a bug in Windows-VM... we implemented out own Process with JNI to compensate this bug).

Just to defend PDFBox: we actually recently decided to move in the opposite direction.


We just removed pdftotext from our application and are now using PDFBox 0.7.0 for all our PDF processing. Before we were using them both in parallel: pdftotext for fast text extraction and PDFBox for all metadata such as titles, authors, etc.

One reason for this is that with version 0.7.0 the difference in performance was only marginal on our testset of 113 PDF documents from various sources. Of course the difference will be bigger when you are only extracting text, because in the old situation we had to let two tools process the same file.

Upon closer inspection of the output, we also saw that pdftotext was not able to extract text from a significant amount of PDFs (9 out of 113 documents, all perfectly readable PDF documents) while PDFBox performed flawlessly. For us, quality is of greater concern than speed.

Finally, I must say that the speed and quality of Ben's replies to bug reports and suggestions is very impressive, giving us confidence in that future problems will be handled satisfactorily.


Regards,

Chris
--

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to