Ben, many thanks for your complrehensive answer. Unfourtunatly I can not send the problem pdfs cause they are the property of company and are of top secrecy:)
Regards, J. Ben Litchfield <[EMAIL PROTECTED]> 22.10.2004 14:40 Please respond to "Lucene Users List" To: Lucene Users List <[EMAIL PROTECTED]> cc: (bcc: Iouli Golovatyi/X/GP/Novartis) Subject: Re: Need advice: what pdf lib to use? Category: Please post any PDFBox issues you notice on the PDFBox sourceforge bug list, if possible attach/email any problem PDFs that you encounter. There are some efforts underway to improve the speed of PDFBox, you can monitor the progress at http://sourceforge.net/tracker/index.php?func=detail&aid=1046300&group_id=78314&atid=552832 As for other suggestions, I know some people have utilized xpdf(open source but non Java) to extract the text. For other Java solutions PDFTextStream(commercial) - "Fastest PDF-to-Text Solution for Java" http://snowtide.com/home/PDFTextStream/ Etymon PJ (GPL) http://www.etymon.com/ Ben http://www.pdfbox.org On Fri, 22 Oct 2004 [EMAIL PROTECTED] wrote: > Hello all, > > I need a piece of advice/experience.. > > What pdf parser (written in java) u'd recommend? > > I played now with PDFBox-0.6.7a and would not say I was satisfied too much > with it > > On certain pdf's (not well formated but anyway readable with acrobate) it > run into dead loop (this I could fix in code), > and on one file it produced "out of memory error" and killed jvm:( (this > problem I could not identify yet) > > After all the performance was not too great as well: it took c. 19 h. to > index 13000 files (c. 3.5Gb) > > Regards, > J. > > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]