[ http://issues.apache.org/jira/browse/NUTCH-348?page=all ]
Stefan Groschupf updated NUTCH-348:
-----------------------------------
Attachment: sortPatchV1.patch
What people think about this kind of solution?
> Generator is building fetch list using *lowest* scoring URLs
> ------------------------------------------------------------
>
> Key: NUTCH-348
> URL: http://issues.apache.org/jira/browse/NUTCH-348
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Reporter: Chris Schneider
> Attachments: sortPatchV1.patch
>
>
> Ever since revision 391271, when the CrawlDatum key was replaced by a
> FloatWritable key, the Generator.Selector.reduce method has been outputting
> the *lowest* scoring URLs! The CrawlDatum class has a Comparator that
> essentially treats higher scoring CrawlDatum objects as "less than" lower
> scoring CrawlDatum objects, so the higher scoring ones would appear first in
> a sequence file sorted using this as the key.
> When a FloatWritable based on the score itself (as returned from
> scfilters.generatorSortValue) became the sort key, it should have been
> negated in Generator.Selector.map to have the same result. Curiously, there
> is a comment to this effect immediately before the FloatWritable is set:
> // sort by decreasing score
> sortValue.set(sort);
> It seems like the simplest way to fix this is to just negate the score, and
> this seems to work for me:
> // sort by decreasing score
> // 2006-08-15 CSc REALLY sort by decreasing score
> sortValue.set(-sort);
> Unfortunately, this means that any crawls that have been done using
> Generator.java after revision 391271 should be discarded, as they were
> focused on fetching the lowest scoring unfetched URLs in the crawldb,
> essentially pointing the crawler 180 degrees from its intended direction.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira