Generator is building fetch list using *lowest* scoring URLs
------------------------------------------------------------

                 Key: NUTCH-348
                 URL: http://issues.apache.org/jira/browse/NUTCH-348
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
            Reporter: Chris Schneider


Ever since revision 391271, when the CrawlDatum key was replaced by a 
FloatWritable key, the Generator.Selector.reduce method has been outputting the 
*lowest* scoring URLs! The CrawlDatum class has a Comparator that essentially 
treats higher scoring CrawlDatum objects as "less than" lower scoring 
CrawlDatum objects, so the higher scoring ones would appear first in a sequence 
file sorted using this as the key.

When a FloatWritable based on the score itself (as returned from 
scfilters.generatorSortValue) became the sort key, it should have been negated 
in Generator.Selector.map to have the same result. Curiously, there is a 
comment to this effect immediately before the FloatWritable is set:

      // sort by decreasing score
      sortValue.set(sort);

It seems like the simplest way to fix this is to just negate the score, and 
this seems to work for me:

      // sort by decreasing score
      // 2006-08-15 CSc REALLY sort by decreasing score
      sortValue.set(-sort);

Unfortunately, this means that any crawls that have been done using 
Generator.java after revision 391271 should be discarded, as they were focused 
on fetching the lowest scoring unfetched URLs in the crawldb, essentially 
pointing the crawler 180 degrees from its intended direction.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to