Generator is building fetch list using *lowest* scoring URLs
------------------------------------------------------------
Key: NUTCH-348
URL: http://issues.apache.org/jira/browse/NUTCH-348
Project: Nutch
Issue Type: Bug
Components: fetcher
Reporter: Chris Schneider
Ever since revision 391271, when the CrawlDatum key was replaced by a
FloatWritable key, the Generator.Selector.reduce method has been outputting the
*lowest* scoring URLs! The CrawlDatum class has a Comparator that essentially
treats higher scoring CrawlDatum objects as "less than" lower scoring
CrawlDatum objects, so the higher scoring ones would appear first in a sequence
file sorted using this as the key.
When a FloatWritable based on the score itself (as returned from
scfilters.generatorSortValue) became the sort key, it should have been negated
in Generator.Selector.map to have the same result. Curiously, there is a
comment to this effect immediately before the FloatWritable is set:
// sort by decreasing score
sortValue.set(sort);
It seems like the simplest way to fix this is to just negate the score, and
this seems to work for me:
// sort by decreasing score
// 2006-08-15 CSc REALLY sort by decreasing score
sortValue.set(-sort);
Unfortunately, this means that any crawls that have been done using
Generator.java after revision 391271 should be discarded, as they were focused
on fetching the lowest scoring unfetched URLs in the crawldb, essentially
pointing the crawler 180 degrees from its intended direction.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira