[jira] [Commented] (NUTCH-2455) Speed up the merging of HostDb entries for variable fetch delay

Sebastian Nagel (JIRA) Wed, 29 Nov 2017 09:34:49 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16271175#comment-16271175
 ]


Sebastian Nagel commented on NUTCH-2455:
----------------------------------------

Correct: the sort value is only used to select the top-N URLs per host. If the 
key is only the host name, then the URLs are passed to the reducer in random 
order which will it make impossible to select the top-N URLs by their score. If 
you use a pair <host, score> as key you can get both: all URLs of a single host 
get into the same call of the reduce function in the right order (higher 
scoring URLs first).

> Speed up the merging of HostDb entries for variable fetch delay
> ---------------------------------------------------------------
>
>                 Key: NUTCH-2455
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2455
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator
>    Affects Versions: 1.13
>            Reporter: Markus Jelsma
>         Attachments: NUTCH-2455.patch
>
>
> Citing Sebastian at NUTCH-2420:
> ??The correct solution would be to use <host,score> pairs as keys in the 
> Selector job, with a partitioner and secondary sorting so that all keys with 
> same host end up in the same call of the reducer. If values can also hold a 
> HostDb entry and the sort comparator guarantees that the HostDb entry 
> (entries if partitioned by domain or IP) comes in front of all CrawlDb 
> entries. But that would be a substantial improvement...??



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (NUTCH-2455) Speed up the merging of HostDb entries for variable fetch delay

Reply via email to