CrawlDbReducer#reduce doesn't have a switch case for CrawlDatum.STATUS-SIGNATURE so we fall into the default (line #121) block which throws a RuntimeException. This causes my update db job to never succeed.

This has just recently started happening.

Enabling logging I see that what usually happens is that a CrawlDatum with a STATUS_SIGNATURE status comes through first and is set to be 'highest' (line #49) but then the next record through takes over the 'highest' role because its status is higher, usually 'fetch_success' or 'linked' in my case.

But for reasons not clear to me, I'll sometimes have a lone CrawlDatum with a status of STATUS_SIGNATURE (A mapout lost a record?) with no following 'fetch_success' or 'linked' CrawlDatum.
This probably shouldn't fail the job.

Attached is a patch that logs a warning and keeps going but probably not the right soln.

Thanks,
St.Ack


Index: src/java/org/apache/nutch/crawl/CrawlDbReducer.java
===================================================================
--- src/java/org/apache/nutch/crawl/CrawlDbReducer.java (revision 397664)
+++ src/java/org/apache/nutch/crawl/CrawlDbReducer.java (working copy)
@@ -19,11 +19,16 @@
 import java.util.Iterator;
 import java.io.IOException;
 
+import java.util.logging.*;
+
 import org.apache.hadoop.io.*;
 import org.apache.hadoop.mapred.*;
+import org.apache.hadoop.util.LogFormatter;
 
 /** Merge new page entries with existing entries. */
 public class CrawlDbReducer implements Reducer {
+  public static final Logger LOG =
+    LogFormatter.getLogger("org.apache.nutch.crawl.CrawlDbReducer");
   private int retryMax;
   private CrawlDatum result = new CrawlDatum();
 
@@ -102,6 +107,8 @@
       result.setNextFetchTime();
       break;
 
+    case CrawlDatum.STATUS_SIGNATURE:
+      LOG.warning("Lone CrawlDatum.STATUS_SIGNATURE: " + key);      
     case CrawlDatum.STATUS_FETCH_RETRY:           // temporary failure
       if (old != null)
         result.setSignature(old.getSignature());  // use old signature
@@ -119,7 +126,7 @@
       break;
 
     default:
-      throw new RuntimeException("Unknown status: "+highest.getStatus());
+      throw new RuntimeException("Unknown status: "+highest.getStatus() + " " 
+ key);
     }
     
     result.setScore(result.getScore() + scoreIncrement);

Reply via email to