Hi,

I stumble with a hanging Fetcher.
The result is similar to that described in NUTCH-719 and NUTCH-721:
Long waiting then "Aborting with xxx hung threads."

I tracked it down and found that the hung FetcherThreads are processing 
exclusively PDF documents.
Now the penny dropped and I connected this with some strange files I observed 
in $PWD after failed
crawls:  PDFBox.log  PDFBox.log.1  PDFBox.log.1.lck

I found that it is a known issue for PDFbox that a file PDFBox.log is written 
in the current directory:
http://www.mail-archive.com/[email protected]/msg00344.html
(the code is still unchanged!)

Indeed it seems that two or more FetcherThreads are blocking each other when 
accidentally called
simultaneously to parse a PDF document.

When removing parse-pdf from the "plugin.includes" the problem dissappears.

Has anyone a quick fix for this problem? Of course, I want to get content also 
from PDF files.

Thanks,

Sebastian

Reply via email to