[
https://issues.apache.org/jira/browse/NUTCH-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17467272#comment-17467272
]
Markus Jelsma commented on NUTCH-2924:
--------------------------------------
Well, to old approach did work if you had a few hosts. But opening and closing
the SequenceFile and scanning it every time for every host quickly becomes
unusable if you have a 100k hosts. This patch, again 1.15, removes the old
logic, and just reads the entire hostdb in memory at start up.
This is still fine if you have up to a million hosts, after that, it requires
serious memory.
> Generate maxCount expr evaluated only once
> ------------------------------------------
>
> Key: NUTCH-2924
> URL: https://issues.apache.org/jira/browse/NUTCH-2924
> Project: Nutch
> Issue Type: Bug
> Components: generator
> Affects Versions: 1.16
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Major
> Fix For: 1.19
>
> Attachments: NUTCH-2924-1.patch, NUTCH-2924-2.patch, NUTCH-2924.patch
>
>
> The generate.maxCount expression is evaluated only once in the generator's
> reducer, instead, it must be set once per host.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)