Dear Otis, I checked out the latest version and successfully built "trunk", the jar file is named pdfbox*0.8.jar. I still have to reconfigure Nutch's parse-pdf's build.xml (I hope not more).
But the code writing to $PWD/PDFBox.log is still in PDFBox. I guess it must be removed or changed otherwise you will run into serious problems when using PDFBox in a concurrent environment such as Nutch. So, someone should get in touch with the PDFBox developers. Shall I? Another suggestion would be (additionally) to run the parser in an own thread controlled by the FetcherThread which stops the parser thread when a timeout is reached. I mean parser plugins are always based on a lot of external libraries such as PDFBox, so you may run again into a similar situation. Sebastian Otis Gospodnetic wrote: > I don't have a fix, but I have a suggestion - have you tried using the very > latest version of PDFBox? I believe it's going through Apache Incubator... > aha, here: http://incubator.apache.org/pdfbox/ > > Too bad the page doesn't say *when* the release was made, so one can get a > sense of the state of the project. > > Otis > -- > Sematext is hiring -- http://sematext.com/about/jobs.html?mls > Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR > > > > ----- Original Message ---- >> From: Sebastian Nagel <[email protected]> >> To: [email protected] >> Sent: Tuesday, August 4, 2009 12:48:22 PM >> Subject: PDFBox log file locks Fetcher >> >> Hi, >> >> I stumble with a hanging Fetcher. >> The result is similar to that described in NUTCH-719 and NUTCH-721: >> Long waiting then "Aborting with xxx hung threads." >> >> I tracked it down and found that the hung FetcherThreads are processing >> exclusively PDF documents. >> Now the penny dropped and I connected this with some strange files I >> observed in >> $PWD after failed >> crawls: PDFBox.log PDFBox.log.1 PDFBox.log.1.lck >> >> I found that it is a known issue for PDFbox that a file PDFBox.log is >> written in >> the current directory: >> http://www.mail-archive.com/[email protected]/msg00344.html >> (the code is still unchanged!) >> >> Indeed it seems that two or more FetcherThreads are blocking each other when >> accidentally called >> simultaneously to parse a PDF document. >> >> When removing parse-pdf from the "plugin.includes" the problem dissappears. >> >> Has anyone a quick fix for this problem? Of course, I want to get content >> also >> from PDF files. >> >> Thanks, >> >> Sebastian >
