Hi,

I believe there is a bug in 'nutch inject'. When the db contains a url with status DB_fetched, and the same url is injected, then the status is (sometimes) reset to DB_unfetched. I belive it depends on the order in which urls make it into the reduce-set.

If this is indeed a bug and can be confirmed by someone else, then the patch below should fix it.

Please comment / advise.

Thanks!
Jochen


Patch:

Index: C:/eclipse/workspace/nutch/src/java/org/apache/nutch/crawl/CrawlDbReducer.java
===================================================================
--- C:/eclipse/workspace/nutch/src/java/org/apache/nutch/crawl/CrawlDbReducer.java (revision 398634) +++ C:/eclipse/workspace/nutch/src/java/org/apache/nutch/crawl/CrawlDbReducer.java (working copy)
@@ -58,7 +58,9 @@
      case CrawlDatum.STATUS_DB_UNFETCHED:
      case CrawlDatum.STATUS_DB_FETCHED:
      case CrawlDatum.STATUS_DB_GONE:
-        old = datum;
+          if(old == null || (old.getStatus() < datum.getStatus())) {
+                  old = datum;
+          }
        break;
      case CrawlDatum.STATUS_LINKED:
        scoreIncrement += datum.getScore();




-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to