Re: PDFBox log file locks Fetcher

Sebastian Nagel Tue, 04 Aug 2009 12:04:02 -0700

Dear Otis,

I checked out the latest version and successfully built "trunk", the jar file 
is named pdfbox*0.8.jar.
I still have to reconfigure Nutch's parse-pdf's build.xml (I hope not more).


But the code writing to $PWD/PDFBox.log is still in PDFBox. I guess it must be 
removed or changed
otherwise you will run into serious problems when using PDFBox in a concurrent 
environment such as
Nutch. So, someone should get in touch with the PDFBox developers. Shall I?

Another suggestion would be (additionally) to run the parser in an own thread 
controlled by the
FetcherThread which stops the parser thread when a timeout is reached. I mean 
parser plugins are
always based on a lot of external libraries such as PDFBox, so you may run 
again into a similar
situation.

Sebastian

Otis Gospodnetic wrote:
> I don't have a fix, but I have a suggestion - have you tried using the very 
> latest version of PDFBox?  I believe it's going through Apache Incubator... 
> aha, here: http://incubator.apache.org/pdfbox/
> 
> Too bad the page doesn't say *when* the release was made, so one can get a 
> sense of the state of the project.
> 
>  Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
> 
> 
> 
> ----- Original Message ----
>> From: Sebastian Nagel <[email protected]>
>> To: [email protected]
>> Sent: Tuesday, August 4, 2009 12:48:22 PM
>> Subject: PDFBox log file locks Fetcher
>>
>> Hi,
>>
>> I stumble with a hanging Fetcher.
>> The result is similar to that described in NUTCH-719 and NUTCH-721:
>> Long waiting then "Aborting with xxx hung threads."
>>
>> I tracked it down and found that the hung FetcherThreads are processing 
>> exclusively PDF documents.
>> Now the penny dropped and I connected this with some strange files I 
>> observed in 
>> $PWD after failed
>> crawls:  PDFBox.log  PDFBox.log.1  PDFBox.log.1.lck
>>
>> I found that it is a known issue for PDFbox that a file PDFBox.log is 
>> written in 
>> the current directory:
>> http://www.mail-archive.com/[email protected]/msg00344.html
>> (the code is still unchanged!)
>>
>> Indeed it seems that two or more FetcherThreads are blocking each other when 
>> accidentally called
>> simultaneously to parse a PDF document.
>>
>> When removing parse-pdf from the "plugin.includes" the problem dissappears.
>>
>> Has anyone a quick fix for this problem? Of course, I want to get content 
>> also 
>> from PDF files.
>>
>> Thanks,
>>
>> Sebastian
>

Re: PDFBox log file locks Fetcher

Reply via email to