[
https://issues.apache.org/jira/browse/NUTCH-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15586690#comment-15586690
]
Sebastian Nagel commented on NUTCH-2328:
----------------------------------------
> the only solution is to have a cluster wide propagated count
No, this is not required. The solution with an instance variable is by design:
- local, per-reducer limit = topN / number of reducers
- every reducer checks only for the local limit
- in sum, there will be topN URLs generated
The condition is that URLs are evenly distributed across different hosts (at
least as many as there are reducers), cf.
[[1|https://www.mail-archive.com/[email protected]/msg14499.html]].
A job-wide counter does not guarantee any limits betters because there is no
control how reduce tasks are launched in time. Only if all tasks run in
parallel, with similar speed and no task fails, an even distribution across
reducers/parts would be achieved. But that will hardly happen in a production
Hadoop cluster. In a realistic scenario some tasks are launched first and will
get more URLs. The tasks launched later get less or even no URLs. However, to
achieve an optimal utilization of the fetcher, all parts should be of equal
size.
> GeneratorJob does not generate anything on second run
> -----------------------------------------------------
>
> Key: NUTCH-2328
> URL: https://issues.apache.org/jira/browse/NUTCH-2328
> Project: Nutch
> Issue Type: Bug
> Components: generator
> Affects Versions: 2.2, 2.3, 2.2.1, 2.3.1
> Environment: Ubuntu 16.04 / Hadoop 2.7.1
> Reporter: Arthur B
> Labels: fails, generator, subsequent
> Fix For: 2.4
>
> Attachments: generator-issue-static-count.patch
>
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> Given a topN parameter (ie 10) the GeneratorJob will fail to generate
> anything new on the subsequent runs within the same process space.
> To reproduce the issue submit the GeneratorJob twice one after another to the
> M/R framework. Second time will say it generated 0 URLs.
> This issue is due to the usage of the static count field
> (org.apache.nutch.crawl.GeneratorReducer#count) to determine if the topN
> value has been reached.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)