I don't have a fix, but I have a suggestion - have you tried using the very 
latest version of PDFBox?  I believe it's going through Apache Incubator... 
aha, here: http://incubator.apache.org/pdfbox/

Too bad the page doesn't say *when* the release was made, so one can get a 
sense of the state of the project.

 Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



----- Original Message ----
> From: Sebastian Nagel <[email protected]>
> To: [email protected]
> Sent: Tuesday, August 4, 2009 12:48:22 PM
> Subject: PDFBox log file locks Fetcher
> 
> Hi,
> 
> I stumble with a hanging Fetcher.
> The result is similar to that described in NUTCH-719 and NUTCH-721:
> Long waiting then "Aborting with xxx hung threads."
> 
> I tracked it down and found that the hung FetcherThreads are processing 
> exclusively PDF documents.
> Now the penny dropped and I connected this with some strange files I observed 
> in 
> $PWD after failed
> crawls:  PDFBox.log  PDFBox.log.1  PDFBox.log.1.lck
> 
> I found that it is a known issue for PDFbox that a file PDFBox.log is written 
> in 
> the current directory:
> http://www.mail-archive.com/[email protected]/msg00344.html
> (the code is still unchanged!)
> 
> Indeed it seems that two or more FetcherThreads are blocking each other when 
> accidentally called
> simultaneously to parse a PDF document.
> 
> When removing parse-pdf from the "plugin.includes" the problem dissappears.
> 
> Has anyone a quick fix for this problem? Of course, I want to get content 
> also 
> from PDF files.
> 
> Thanks,
> 
> Sebastian

Reply via email to