I don't have a fix, but I have a suggestion - have you tried using the very latest version of PDFBox? I believe it's going through Apache Incubator... aha, here: http://incubator.apache.org/pdfbox/
Too bad the page doesn't say *when* the release was made, so one can get a sense of the state of the project. Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR ----- Original Message ---- > From: Sebastian Nagel <[email protected]> > To: [email protected] > Sent: Tuesday, August 4, 2009 12:48:22 PM > Subject: PDFBox log file locks Fetcher > > Hi, > > I stumble with a hanging Fetcher. > The result is similar to that described in NUTCH-719 and NUTCH-721: > Long waiting then "Aborting with xxx hung threads." > > I tracked it down and found that the hung FetcherThreads are processing > exclusively PDF documents. > Now the penny dropped and I connected this with some strange files I observed > in > $PWD after failed > crawls: PDFBox.log PDFBox.log.1 PDFBox.log.1.lck > > I found that it is a known issue for PDFbox that a file PDFBox.log is written > in > the current directory: > http://www.mail-archive.com/[email protected]/msg00344.html > (the code is still unchanged!) > > Indeed it seems that two or more FetcherThreads are blocking each other when > accidentally called > simultaneously to parse a PDF document. > > When removing parse-pdf from the "plugin.includes" the problem dissappears. > > Has anyone a quick fix for this problem? Of course, I want to get content > also > from PDF files. > > Thanks, > > Sebastian
