I should have included the link, but I used PDFBox. Thanks,
Steve Betts [EMAIL PROTECTED] 937-477-1797 -----Original Message----- From: "Håvard W. Kongsgård" [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 25, 2006 10:34 AM To: [email protected] Subject: Re: Parsing PDF Nutch Achilles heel? From where do I get the new version http://www.pdfbox.org/ or http://svn.apache.org/viewcvs.cgi/lucene/nutch/ Steve Betts wrote: >There is a bug in the PDF parser tool used with 0.7. You can get a newer >version to replace the jars with the parse-pdf plugin and the freeze will go >away. > >Thanks, > >Steve Betts >[EMAIL PROTECTED] >937-477-1797 > >-----Original Message----- >From: "Håvard W. Kongsgård" [mailto:[EMAIL PROTECTED] >Sent: Wednesday, January 25, 2006 10:10 AM >To: [email protected] >Subject: Parsing PDF Nutch Achilles heel? > >I have been doing some testing on different nutch configurations to see >what slows down the fetching process on my servers(nutch 0.7.1). >My general experience is that the PDF parse process is nutchs Achilles heel. > >Nutch works fine on older computers, but with the combination of >|parse-(text|html|pdf) >and http.content.limit = -1(needed to get PDF parsing to work) nutch >sometimes freezes completely. > >Is there planned any improvement to the parsing of PDF files in the next >version of nutch (0.8)? > > > > >
