I should have included the link, but I used PDFBox.

Thanks,

Steve Betts
[EMAIL PROTECTED]
937-477-1797


-----Original Message-----
From: "Håvard W. Kongsgård" [mailto:[EMAIL PROTECTED]
Sent: Wednesday, January 25, 2006 10:34 AM
To: [email protected]
Subject: Re: Parsing PDF Nutch Achilles heel?

 From where do I get the new version http://www.pdfbox.org/ or
http://svn.apache.org/viewcvs.cgi/lucene/nutch/



Steve Betts wrote:

>There is a bug in the PDF parser tool used with 0.7. You can get a newer
>version to replace the jars with the parse-pdf plugin and the freeze will
go
>away.
>
>Thanks,
>
>Steve Betts
>[EMAIL PROTECTED]
>937-477-1797
>
>-----Original Message-----
>From: "Håvard W. Kongsgård" [mailto:[EMAIL PROTECTED]
>Sent: Wednesday, January 25, 2006 10:10 AM
>To: [email protected]
>Subject: Parsing PDF Nutch Achilles heel?
>
>I have been doing some testing on different nutch configurations to see
>what slows down the fetching process on my servers(nutch 0.7.1).
>My general experience is that the PDF parse process is nutchs Achilles
heel.
>
>Nutch works fine on older computers, but with the combination of
>|parse-(text|html|pdf)
>and http.content.limit = -1(needed to get PDF parsing to work) nutch
>sometimes freezes completely.
>
>Is there planned any improvement to the parsing of PDF files in the next
>version of nutch (0.8)?
>
>
>
>
>


Reply via email to