Hi,
I believe there is a bug in 'nutch inject'. When the db contains a url
with status DB_fetched, and the same url is injected, then the status is
(sometimes) reset to DB_unfetched. I belive it depends on the order in
which urls make it into the reduce-set.
If this is indeed a bug and can be confirmed by someone else, then the
patch below should fix it.
Please comment / advise.
Thanks!
Jochen
Patch:
Index:
C:/eclipse/workspace/nutch/src/java/org/apache/nutch/crawl/CrawlDbReducer.java
===================================================================
---
C:/eclipse/workspace/nutch/src/java/org/apache/nutch/crawl/CrawlDbReducer.java
(revision 398634)
+++
C:/eclipse/workspace/nutch/src/java/org/apache/nutch/crawl/CrawlDbReducer.java
(working copy)
@@ -58,7 +58,9 @@
case CrawlDatum.STATUS_DB_UNFETCHED:
case CrawlDatum.STATUS_DB_FETCHED:
case CrawlDatum.STATUS_DB_GONE:
- old = datum;
+ if(old == null || (old.getStatus() < datum.getStatus())) {
+ old = datum;
+ }
break;
case CrawlDatum.STATUS_LINKED:
scoreIncrement += datum.getScore();
-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers