[ http://issues.apache.org/jira/browse/NUTCH-348?page=all ]

Stefan Groschupf updated NUTCH-348:
-----------------------------------

    Attachment: sortPatchV1.patch

What people think about this kind of solution?

> Generator is building fetch list using *lowest* scoring URLs
> ------------------------------------------------------------
>
>                 Key: NUTCH-348
>                 URL: http://issues.apache.org/jira/browse/NUTCH-348
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>            Reporter: Chris Schneider
>         Attachments: sortPatchV1.patch
>
>
> Ever since revision 391271, when the CrawlDatum key was replaced by a 
> FloatWritable key, the Generator.Selector.reduce method has been outputting 
> the *lowest* scoring URLs! The CrawlDatum class has a Comparator that 
> essentially treats higher scoring CrawlDatum objects as "less than" lower 
> scoring CrawlDatum objects, so the higher scoring ones would appear first in 
> a sequence file sorted using this as the key.
> When a FloatWritable based on the score itself (as returned from 
> scfilters.generatorSortValue) became the sort key, it should have been 
> negated in Generator.Selector.map to have the same result. Curiously, there 
> is a comment to this effect immediately before the FloatWritable is set:
>       // sort by decreasing score
>       sortValue.set(sort);
> It seems like the simplest way to fix this is to just negate the score, and 
> this seems to work for me:
>       // sort by decreasing score
>       // 2006-08-15 CSc REALLY sort by decreasing score
>       sortValue.set(-sort);
> Unfortunately, this means that any crawls that have been done using 
> Generator.java after revision 391271 should be discarded, as they were 
> focused on fetching the lowest scoring unfetched URLs in the crawldb, 
> essentially pointing the crawler 180 degrees from its intended direction.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to