[jira] [Commented] (NUTCH-2455) Speed up the merging of HostDb entries for variable fetch delay

Sebastian Nagel (JIRA) Thu, 30 Nov 2017 09:19:16 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16272978#comment-16272978
 ]


Sebastian Nagel commented on NUTCH-2455:
----------------------------------------

Right now the HostDb is read in every call of the map function. That's really 
inefficient. There may be also several hundreds map tasks (depends mostly on 
the size of the CrawlDb and how it is stored: number of parts, parts are 
splittable). That's also a waste of resources to read the HostDb multiple 
times. Also keep in mind: how to pass the right HostDb entry to the map 
function? I'm not aware of a way to do it, except for the usual way as 
mapreduce input (and keeping the whole HostDb in memory, but that does not 
scale).

> Speed up the merging of HostDb entries for variable fetch delay
> ---------------------------------------------------------------
>
>                 Key: NUTCH-2455
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2455
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator
>    Affects Versions: 1.13
>            Reporter: Markus Jelsma
>         Attachments: NUTCH-2455.patch
>
>
> Citing Sebastian at NUTCH-2420:
> ??The correct solution would be to use <host,score> pairs as keys in the 
> Selector job, with a partitioner and secondary sorting so that all keys with 
> same host end up in the same call of the reducer. If values can also hold a 
> HostDb entry and the sort comparator guarantees that the HostDb entry 
> (entries if partitioned by domain or IP) comes in front of all CrawlDb 
> entries. But that would be a substantial improvement...??



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (NUTCH-2455) Speed up the merging of HostDb entries for variable fetch delay

Reply via email to