RE: Parsing PDF Nutch Achilles heel?

Steve Betts Wed, 25 Jan 2006 08:09:34 -0800

I am using PDFBox-0.7.2-log4j.jar. That doesn't make it run a lot faster,
but it does allow it to complete.


Thanks,

Steve Betts
[EMAIL PROTECTED]
937-477-1797


-----Original Message-----
From: "Håvard W. Kongsgård" [mailto:[EMAIL PROTECTED]
Sent: Wednesday, January 25, 2006 10:49 AM
To: [email protected]
Subject: Re: Parsing PDF Nutch Achilles heel?

PDFBox-0.7.2 or one of the nightly builds PDFBox-0.7.3-dev...

Steve Betts wrote:

>I should have included the link, but I used PDFBox.
>
>Thanks,
>
>Steve Betts
>[EMAIL PROTECTED]
>937-477-1797
>
>
>-----Original Message-----
>From: "Håvard W. Kongsgård" [mailto:[EMAIL PROTECTED]
>Sent: Wednesday, January 25, 2006 10:34 AM
>To: [email protected]
>Subject: Re: Parsing PDF Nutch Achilles heel?
>
> From where do I get the new version http://www.pdfbox.org/ or
>http://svn.apache.org/viewcvs.cgi/lucene/nutch/
>
>
>
>Steve Betts wrote:
>
>
>
>>There is a bug in the PDF parser tool used with 0.7. You can get a newer
>>version to replace the jars with the parse-pdf plugin and the freeze will
>>
>>
>go
>
>
>>away.
>>
>>Thanks,
>>
>>Steve Betts
>>[EMAIL PROTECTED]
>>937-477-1797
>>
>>-----Original Message-----
>>From: "Håvard W. Kongsgård" [mailto:[EMAIL PROTECTED]
>>Sent: Wednesday, January 25, 2006 10:10 AM
>>To: [email protected]
>>Subject: Parsing PDF Nutch Achilles heel?
>>
>>I have been doing some testing on different nutch configurations to see
>>what slows down the fetching process on my servers(nutch 0.7.1).
>>My general experience is that the PDF parse process is nutchs Achilles
>>
>>
>heel.
>
>
>>Nutch works fine on older computers, but with the combination of
>>|parse-(text|html|pdf)
>>and http.content.limit = -1(needed to get PDF parsing to work) nutch
>>sometimes freezes completely.
>>
>>Is there planned any improvement to the parsing of PDF files in the next
>>version of nutch (0.8)?
>>
>>
>>
>>
>>
>>
>>
>
>
>
>
>

RE: Parsing PDF Nutch Achilles heel?

Reply via email to