[
https://issues.apache.org/jira/browse/NUTCH-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16272978#comment-16272978
]
Sebastian Nagel commented on NUTCH-2455:
----------------------------------------
Right now the HostDb is read in every call of the map function. That's really
inefficient. There may be also several hundreds map tasks (depends mostly on
the size of the CrawlDb and how it is stored: number of parts, parts are
splittable). That's also a waste of resources to read the HostDb multiple
times. Also keep in mind: how to pass the right HostDb entry to the map
function? I'm not aware of a way to do it, except for the usual way as
mapreduce input (and keeping the whole HostDb in memory, but that does not
scale).
> Speed up the merging of HostDb entries for variable fetch delay
> ---------------------------------------------------------------
>
> Key: NUTCH-2455
> URL: https://issues.apache.org/jira/browse/NUTCH-2455
> Project: Nutch
> Issue Type: Improvement
> Components: generator
> Affects Versions: 1.13
> Reporter: Markus Jelsma
> Attachments: NUTCH-2455.patch
>
>
> Citing Sebastian at NUTCH-2420:
> ??The correct solution would be to use <host,score> pairs as keys in the
> Selector job, with a partitioner and secondary sorting so that all keys with
> same host end up in the same call of the reducer. If values can also hold a
> HostDb entry and the sort comparator guarantees that the HostDb entry
> (entries if partitioned by domain or IP) comes in front of all CrawlDb
> entries. But that would be a substantial improvement...??
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)