(redirected to nutch-dev)

[EMAIL PROTECTED] wrote:
CrawlDbReducer#reduce doesn't have a switch case for CrawlDatum.STATUS-SIGNATURE so we fall into the default (line #121) block which throws a RuntimeException. This causes my update db job to never succeed.

This has just recently started happening.

Enabling logging I see that what usually happens is that a CrawlDatum with a STATUS_SIGNATURE status comes through first and is set to be 'highest' (line #49) but then the next record through takes over the 'highest' role because its status is higher, usually 'fetch_success' or 'linked' in my case.

But for reasons not clear to me, I'll sometimes have a lone CrawlDatum with a status of STATUS_SIGNATURE (A mapout lost a record?) with no following 'fetch_success' or 'linked' CrawlDatum. This probably shouldn't fail the job.

Attached is a patch that logs a warning and keeps going but probably not the right soln.

How weird, This Should Never Happen(tm) ... ;) Lost map output should show up in logs, or perhaps even should've killed the job, isn't that so? I'll apply your patch for now, but we need to keep an eye on this.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Index: src/java/org/apache/nutch/crawl/CrawlDbReducer.java
===================================================================
--- src/java/org/apache/nutch/crawl/CrawlDbReducer.java (revision 397664)
+++ src/java/org/apache/nutch/crawl/CrawlDbReducer.java (working copy)
@@ -19,11 +19,16 @@
 import java.util.Iterator;
 import java.io.IOException;
 
+import java.util.logging.*;
+
 import org.apache.hadoop.io.*;
 import org.apache.hadoop.mapred.*;
+import org.apache.hadoop.util.LogFormatter;
 
 /** Merge new page entries with existing entries. */
 public class CrawlDbReducer implements Reducer {
+  public static final Logger LOG =
+    LogFormatter.getLogger("org.apache.nutch.crawl.CrawlDbReducer");
   private int retryMax;
   private CrawlDatum result = new CrawlDatum();
 
@@ -102,6 +107,8 @@
       result.setNextFetchTime();
       break;
 
+    case CrawlDatum.STATUS_SIGNATURE:
+      LOG.warning("Lone CrawlDatum.STATUS_SIGNATURE: " + key);      
     case CrawlDatum.STATUS_FETCH_RETRY:           // temporary failure
       if (old != null)
         result.setSignature(old.getSignature());  // use old signature
@@ -119,7 +126,7 @@
       break;
 
     default:
-      throw new RuntimeException("Unknown status: "+highest.getStatus());
+      throw new RuntimeException("Unknown status: "+highest.getStatus() + " " 
+ key);
     }
     
     result.setScore(result.getScore() + scoreIncrement);

Reply via email to