Just to pinpoint how serious it was for may crawl:

The crawl was started with 4000 seed URLs, all from different hosts,
and the following options resp. properties:
 threads = 10
 depth = 7
 generate.max.per.host = 7
 topN = 28000 (4000*7)

I grepped in the log file how many documents have been fetched per cycle,
and how many threads have benn hung:

cycle fetching hung_threads
1        3863     0
2       16624     1
3         943    10
4         296    10
5         134    10
6          61    10
7          50    10

The number of crawled documents per cycle converges to zero due to the blocked 
FetcherThreads.
For every new cycle unfetched PDF documents fill the queue more and more.


I now have a solution. I had to patch Nutch's parse-pdf plugin as well as 
PDFBox.
The first trial with the current version of PDFBox didn't change anything.
The code opening the log file PDFBox.log has not changed.

For PDFBox (trunk from svn) apply the patch:

% svn diff src/main/java/org/apache/pdfbox/exceptions/LoggingObject.java
Index: src/main/java/org/apache/pdfbox/exceptions/LoggingObject.java
===================================================================
--- src/main/java/org/apache/pdfbox/exceptions/LoggingObject.java       
(Revision 801087)
+++ src/main/java/org/apache/pdfbox/exceptions/LoggingObject.java       
(Arbeitskopie)
@@ -35,10 +35,10 @@

        //http://www.rgagnon.com/javadetails/java-0501.html
        if (logger_ == null){
-               FileHandler fh = new FileHandler("PDFBox.log", true);
-               fh.setFormatter(new SimpleFormatter());
+               // FileHandler fh = new FileHandler("PDFBox.log", true);
+               // fh.setFormatter(new SimpleFormatter());
                logger_ = Logger.getLogger("TestLog");
-               logger_.addHandler(fh);
+               // logger_.addHandler(fh);

             /*Set the log level here.
             The lower your logging level, the more stuff will be logged.

(Of course, commenting out is not a real solution.)
Then run ant in trunk/ to build PDFBox and copy
 <PFDBox>/trunk/lib/pdfbox-0.8.0-incubating.jar
 <PFDBox>/trunk/external/fontbox-0.8.0-incubating.jar
 <PFDBox>/trunk/external/jempbox-0.8.0-incubating.jar
to <Nutch>/src/plugin/parse-pdf/lib/
(Still TODO: renew license files)



Now apply the patches to Nutch 1.0 parse-pdf:

src/plugin/parse-pdf/plugin.xml

--- src/plugin/parse-pdf/plugin.xml~ 2009-03-23 20:04:06.000000000 +0100
+++ src/plugin/parse-pdf/plugin.xml  2009-08-05 11:28:06.000000000 +0200
@@ -26,9 +26,9 @@
       <library name="parse-pdf.jar">
          <export name="*"/>
       </library>
-      <library name="PDFBox-0.7.4-dev.jar"/>
-      <library name="FontBox-0.2.0-dev.jar"/>
-      <library name="JempBox-0.2.0-dev.jar"/>
+      <library name="pdfbox-0.8.0-incubating.jar"/>
+      <library name="fontbox-0.8.0-incubating.jar"/>
+      <library name="jempbox-0.8.0-incubating.jar"/>
       <library name="bcprov-jdk14-132.jar"/>
       <!-- Uncomment the following two lines after you have downloaded the
            libraries, see README.txt for more details.-->



src/plugin/parse-pdf/src/java/org/apache/nutch/parse/pdf/PdfParser.java

--- src/plugin/parse-pdf/src/java/org/apache/nutch/parse/pdf/PdfParser.java~ 
2009-03-23 20:04:01.000000000 +0100
+++ src/plugin/parse-pdf/src/java/org/apache/nutch/parse/pdf/PdfParser.java  
2009-08-05 11:16:35.000000000 +0200
@@ -17,14 +17,14 @@

 package org.apache.nutch.parse.pdf;

-import org.pdfbox.pdfparser.PDFParser;
-import org.pdfbox.pdmodel.PDDocument;
-import org.pdfbox.pdmodel.PDDocumentInformation;
-import org.pdfbox.pdmodel.encryption.BadSecurityHandlerException;
-import org.pdfbox.pdmodel.encryption.StandardDecryptionMaterial;
-import org.pdfbox.util.PDFTextStripper;
+import org.apache.pdfbox.pdfparser.PDFParser;
+import org.apache.pdfbox.pdmodel.PDDocument;
+import org.apache.pdfbox.pdmodel.PDDocumentInformation;
+import org.apache.pdfbox.pdmodel.encryption.BadSecurityHandlerException;
+import org.apache.pdfbox.pdmodel.encryption.StandardDecryptionMaterial;
+import org.apache.pdfbox.util.PDFTextStripper;

-import org.pdfbox.exceptions.CryptographyException;
+import org.apache.pdfbox.exceptions.CryptographyException;

 // Commons Logging imports
 import org.apache.commons.logging.Log;


... and build Nutch. I tested

What about further steps? Is there a maintainer for parse-pdf?

Bye, Sebastian

Reply via email to