I am using PDFBox-0.7.2-log4j.jar. That doesn't make it run a lot faster, but it does allow it to complete.
Thanks, Steve Betts [EMAIL PROTECTED] 937-477-1797 -----Original Message----- From: "Håvard W. Kongsgård" [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 25, 2006 10:49 AM To: [email protected] Subject: Re: Parsing PDF Nutch Achilles heel? PDFBox-0.7.2 or one of the nightly builds PDFBox-0.7.3-dev... Steve Betts wrote: >I should have included the link, but I used PDFBox. > >Thanks, > >Steve Betts >[EMAIL PROTECTED] >937-477-1797 > > >-----Original Message----- >From: "Håvard W. Kongsgård" [mailto:[EMAIL PROTECTED] >Sent: Wednesday, January 25, 2006 10:34 AM >To: [email protected] >Subject: Re: Parsing PDF Nutch Achilles heel? > > From where do I get the new version http://www.pdfbox.org/ or >http://svn.apache.org/viewcvs.cgi/lucene/nutch/ > > > >Steve Betts wrote: > > > >>There is a bug in the PDF parser tool used with 0.7. You can get a newer >>version to replace the jars with the parse-pdf plugin and the freeze will >> >> >go > > >>away. >> >>Thanks, >> >>Steve Betts >>[EMAIL PROTECTED] >>937-477-1797 >> >>-----Original Message----- >>From: "Håvard W. Kongsgård" [mailto:[EMAIL PROTECTED] >>Sent: Wednesday, January 25, 2006 10:10 AM >>To: [email protected] >>Subject: Parsing PDF Nutch Achilles heel? >> >>I have been doing some testing on different nutch configurations to see >>what slows down the fetching process on my servers(nutch 0.7.1). >>My general experience is that the PDF parse process is nutchs Achilles >> >> >heel. > > >>Nutch works fine on older computers, but with the combination of >>|parse-(text|html|pdf) >>and http.content.limit = -1(needed to get PDF parsing to work) nutch >>sometimes freezes completely. >> >>Is there planned any improvement to the parsing of PDF files in the next >>version of nutch (0.8)? >> >> >> >> >> >> >> > > > > >
