Hi, I stumble with a hanging Fetcher. The result is similar to that described in NUTCH-719 and NUTCH-721: Long waiting then "Aborting with xxx hung threads."
I tracked it down and found that the hung FetcherThreads are processing exclusively PDF documents. Now the penny dropped and I connected this with some strange files I observed in $PWD after failed crawls: PDFBox.log PDFBox.log.1 PDFBox.log.1.lck I found that it is a known issue for PDFbox that a file PDFBox.log is written in the current directory: http://www.mail-archive.com/[email protected]/msg00344.html (the code is still unchanged!) Indeed it seems that two or more FetcherThreads are blocking each other when accidentally called simultaneously to parse a PDF document. When removing parse-pdf from the "plugin.includes" the problem dissappears. Has anyone a quick fix for this problem? Of course, I want to get content also from PDF files. Thanks, Sebastian
