Just to pinpoint how serious it was for may crawl:
The crawl was started with 4000 seed URLs, all from different hosts,
and the following options resp. properties:
threads = 10
depth = 7
generate.max.per.host = 7
topN = 28000 (4000*7)
I grepped in the log file how many documents have been fetched per cycle,
and how many threads have benn hung:
cycle fetching hung_threads
1 3863 0
2 16624 1
3 943 10
4 296 10
5 134 10
6 61 10
7 50 10
The number of crawled documents per cycle converges to zero due to the blocked
FetcherThreads.
For every new cycle unfetched PDF documents fill the queue more and more.
I now have a solution. I had to patch Nutch's parse-pdf plugin as well as
PDFBox.
The first trial with the current version of PDFBox didn't change anything.
The code opening the log file PDFBox.log has not changed.
For PDFBox (trunk from svn) apply the patch:
% svn diff src/main/java/org/apache/pdfbox/exceptions/LoggingObject.java
Index: src/main/java/org/apache/pdfbox/exceptions/LoggingObject.java
===================================================================
--- src/main/java/org/apache/pdfbox/exceptions/LoggingObject.java
(Revision 801087)
+++ src/main/java/org/apache/pdfbox/exceptions/LoggingObject.java
(Arbeitskopie)
@@ -35,10 +35,10 @@
//http://www.rgagnon.com/javadetails/java-0501.html
if (logger_ == null){
- FileHandler fh = new FileHandler("PDFBox.log", true);
- fh.setFormatter(new SimpleFormatter());
+ // FileHandler fh = new FileHandler("PDFBox.log", true);
+ // fh.setFormatter(new SimpleFormatter());
logger_ = Logger.getLogger("TestLog");
- logger_.addHandler(fh);
+ // logger_.addHandler(fh);
/*Set the log level here.
The lower your logging level, the more stuff will be logged.
(Of course, commenting out is not a real solution.)
Then run ant in trunk/ to build PDFBox and copy
<PFDBox>/trunk/lib/pdfbox-0.8.0-incubating.jar
<PFDBox>/trunk/external/fontbox-0.8.0-incubating.jar
<PFDBox>/trunk/external/jempbox-0.8.0-incubating.jar
to <Nutch>/src/plugin/parse-pdf/lib/
(Still TODO: renew license files)
Now apply the patches to Nutch 1.0 parse-pdf:
src/plugin/parse-pdf/plugin.xml
--- src/plugin/parse-pdf/plugin.xml~ 2009-03-23 20:04:06.000000000 +0100
+++ src/plugin/parse-pdf/plugin.xml 2009-08-05 11:28:06.000000000 +0200
@@ -26,9 +26,9 @@
<library name="parse-pdf.jar">
<export name="*"/>
</library>
- <library name="PDFBox-0.7.4-dev.jar"/>
- <library name="FontBox-0.2.0-dev.jar"/>
- <library name="JempBox-0.2.0-dev.jar"/>
+ <library name="pdfbox-0.8.0-incubating.jar"/>
+ <library name="fontbox-0.8.0-incubating.jar"/>
+ <library name="jempbox-0.8.0-incubating.jar"/>
<library name="bcprov-jdk14-132.jar"/>
<!-- Uncomment the following two lines after you have downloaded the
libraries, see README.txt for more details.-->
src/plugin/parse-pdf/src/java/org/apache/nutch/parse/pdf/PdfParser.java
--- src/plugin/parse-pdf/src/java/org/apache/nutch/parse/pdf/PdfParser.java~
2009-03-23 20:04:01.000000000 +0100
+++ src/plugin/parse-pdf/src/java/org/apache/nutch/parse/pdf/PdfParser.java
2009-08-05 11:16:35.000000000 +0200
@@ -17,14 +17,14 @@
package org.apache.nutch.parse.pdf;
-import org.pdfbox.pdfparser.PDFParser;
-import org.pdfbox.pdmodel.PDDocument;
-import org.pdfbox.pdmodel.PDDocumentInformation;
-import org.pdfbox.pdmodel.encryption.BadSecurityHandlerException;
-import org.pdfbox.pdmodel.encryption.StandardDecryptionMaterial;
-import org.pdfbox.util.PDFTextStripper;
+import org.apache.pdfbox.pdfparser.PDFParser;
+import org.apache.pdfbox.pdmodel.PDDocument;
+import org.apache.pdfbox.pdmodel.PDDocumentInformation;
+import org.apache.pdfbox.pdmodel.encryption.BadSecurityHandlerException;
+import org.apache.pdfbox.pdmodel.encryption.StandardDecryptionMaterial;
+import org.apache.pdfbox.util.PDFTextStripper;
-import org.pdfbox.exceptions.CryptographyException;
+import org.apache.pdfbox.exceptions.CryptographyException;
// Commons Logging imports
import org.apache.commons.logging.Log;
... and build Nutch. I tested
What about further steps? Is there a maintainer for parse-pdf?
Bye, Sebastian