[
https://issues.apache.org/jira/browse/NUTCH-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407778#comment-16407778
]
Semyon Semyonov commented on NUTCH-2455:
----------------------------------------
I see a conflict for this branch and master, let me know when you want to merge
it and I'm going to fix them.
By the way, we ran it several times for number of hosts in between 100 000 and
2 000 000 , it worked quite well.
> Speed up the merging of HostDb entries for variable fetch delay
> ---------------------------------------------------------------
>
> Key: NUTCH-2455
> URL: https://issues.apache.org/jira/browse/NUTCH-2455
> Project: Nutch
> Issue Type: Improvement
> Components: generator
> Affects Versions: 1.13
> Reporter: Markus Jelsma
> Priority: Major
> Attachments: NUTCH-2455.patch
>
>
> Citing Sebastian at NUTCH-2420:
> ??The correct solution would be to use <host,score> pairs as keys in the
> Selector job, with a partitioner and secondary sorting so that all keys with
> same host end up in the same call of the reducer. If values can also hold a
> HostDb entry and the sort comparator guarantees that the HostDb entry
> (entries if partitioned by domain or IP) comes in front of all CrawlDb
> entries. But that would be a substantial improvement...??
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)