CrawlDbReducer#reduce doesn't have a switch case for
CrawlDatum.STATUS-SIGNATURE so we fall into the default (line #121)
block which throws a RuntimeException. This causes my update db job to
never succeed.
This has just recently started happening.
Enabling logging I see that what usually happens is that a CrawlDatum
with a STATUS_SIGNATURE status comes through first and is set to be
'highest' (line #49) but then the next record through takes over the
'highest' role because its status is higher, usually 'fetch_success' or
'linked' in my case.
But for reasons not clear to me, I'll sometimes have a lone CrawlDatum
with a status of STATUS_SIGNATURE (A mapout lost a record?) with no
following 'fetch_success' or 'linked' CrawlDatum.
This probably shouldn't fail the job.
Attached is a patch that logs a warning and keeps going but probably not
the right soln.
Thanks,
St.Ack
Index: src/java/org/apache/nutch/crawl/CrawlDbReducer.java
===================================================================
--- src/java/org/apache/nutch/crawl/CrawlDbReducer.java (revision 397664)
+++ src/java/org/apache/nutch/crawl/CrawlDbReducer.java (working copy)
@@ -19,11 +19,16 @@
import java.util.Iterator;
import java.io.IOException;
+import java.util.logging.*;
+
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
+import org.apache.hadoop.util.LogFormatter;
/** Merge new page entries with existing entries. */
public class CrawlDbReducer implements Reducer {
+ public static final Logger LOG =
+ LogFormatter.getLogger("org.apache.nutch.crawl.CrawlDbReducer");
private int retryMax;
private CrawlDatum result = new CrawlDatum();
@@ -102,6 +107,8 @@
result.setNextFetchTime();
break;
+ case CrawlDatum.STATUS_SIGNATURE:
+ LOG.warning("Lone CrawlDatum.STATUS_SIGNATURE: " + key);
case CrawlDatum.STATUS_FETCH_RETRY: // temporary failure
if (old != null)
result.setSignature(old.getSignature()); // use old signature
@@ -119,7 +126,7 @@
break;
default:
- throw new RuntimeException("Unknown status: "+highest.getStatus());
+ throw new RuntimeException("Unknown status: "+highest.getStatus() + " "
+ key);
}
result.setScore(result.getScore() + scoreIncrement);