I've tried all 3 of those and none have worked out for me. Our intranet has 802 PDFs from lots of (vendor) sources and all the pure java parsers have trouble w/ some of them. I've since gone to pdftotext from xpdf at the link below. True, not pure java, but it works on all platforms w/ my doc set and I suggest people use it, esp if they have any troubles w/ the java stuff below.
http://www.foolabs.com/xpdf/ problems: some java parsers have trouble w/ the "dummy" encryption used, some parsers go into loops w/ some docs, and some parsers crash on some docs. Yes, I've reported some of these problems to the authors. -----Original Message----- From: Borkenhagen, Michael (ofd-ko zdfin) [mailto:[EMAIL PROTECTED]] Sent: Friday, November 22, 2002 6:42 AM To: 'Lucene Users List' Subject: AW: PDF parser There are different Parsers available - every Parser has other advantages and disadvantages. I use a combination of the PDFBox http://www.pdfbox.org/ and Etymon PJ http://www.etymon.com/pjc/, cause their APIs are very simple. Both of them parse PDF in a format of their own an provide interfaces to get the PDF Documents contents. Other developers on this list prefer JPedal http://www.jpedal.org/ which parses PDF into XML an provide a XML Tree with the PDF Documents contents. JPedal does the work best, but the Documentation isn�t very detailed. Micha -----Urspr�ngliche Nachricht----- Von: Thomas Chacko [mailto:[EMAIL PROTECTED]] Gesendet: Freitag, 22. November 2002 15:26 An: Lucene Users List Betreff: PDF parser Whats the best parser available to extarct text from PDF documents. Expecting a reply ASAP Thanks in advance Thomas Chacko -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]> -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
