I've tried all 3 of those and none have worked out for me.
Our intranet has 802 PDFs from lots of (vendor) sources and
all the pure java parsers have trouble w/ some of them.
I've since gone to pdftotext from xpdf at the link below.
True, not pure java, but it works on all platforms
w/ my doc set and I suggest people use it, esp if they
have any troubles w/ the java stuff below.

http://www.foolabs.com/xpdf/

problems: some java parsers have trouble w/ the "dummy" encryption
used, some parsers go into loops w/ some docs, and some parsers
crash on some docs. Yes, I've reported some of these problems to the
authors.

-----Original Message-----
From: Borkenhagen, Michael (ofd-ko zdfin)
[mailto:[EMAIL PROTECTED]]
Sent: Friday, November 22, 2002 6:42 AM
To: 'Lucene Users List'
Subject: AW: PDF parser


There are different Parsers available - every Parser has other
advantages
and disadvantages.
I use a combination of the PDFBox  http://www.pdfbox.org/ and Etymon PJ
http://www.etymon.com/pjc/, cause their APIs are very simple. Both of
them
parse PDF in a format of their own an provide interfaces to get the PDF
Documents contents.

Other developers on this list prefer JPedal http://www.jpedal.org/ which
parses PDF into XML an provide a XML Tree with the PDF Documents
contents. 
JPedal does the work best, but the Documentation isn�t very detailed.

Micha

-----Urspr�ngliche Nachricht-----
Von: Thomas Chacko [mailto:[EMAIL PROTECTED]]
Gesendet: Freitag, 22. November 2002 15:26
An: Lucene Users List
Betreff: PDF parser


Whats the best parser available to extarct text from PDF documents.
Expecting a reply ASAP

Thanks in advance
Thomas Chacko


--
To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>



--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Reply via email to