[jira] [Commented] (NUTCH-2328) GeneratorJob does not generate anything on second run

Sebastian Nagel (JIRA) Tue, 18 Oct 2016 14:19:03 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15586690#comment-15586690
 ]


Sebastian Nagel commented on NUTCH-2328:
----------------------------------------

> the only solution is to have a cluster wide propagated count 

No, this is not required. The solution with an instance variable is by design:
- local, per-reducer limit = topN / number of reducers
- every reducer checks only for the local limit
- in sum, there will be topN URLs generated

The condition is that URLs are evenly distributed across different hosts (at 
least as many as there are reducers), cf. 
[[1|https://www.mail-archive.com/[email protected]/msg14499.html]].

A job-wide counter does not guarantee any limits betters because there is no 
control how reduce tasks are launched in time. Only if all tasks run in 
parallel, with similar speed and no task fails, an even distribution across 
reducers/parts would be achieved. But that will hardly happen in a production 
Hadoop cluster. In a realistic scenario some tasks are launched first and will 
get more URLs. The tasks launched later get less or even no URLs. However, to 
achieve an optimal utilization of the fetcher, all parts should be of equal 
size.

> GeneratorJob does not generate anything on second run
> -----------------------------------------------------
>
>                 Key: NUTCH-2328
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2328
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 2.2, 2.3, 2.2.1, 2.3.1
>         Environment: Ubuntu 16.04 / Hadoop 2.7.1
>            Reporter: Arthur B
>              Labels: fails, generator, subsequent
>             Fix For: 2.4
>
>         Attachments: generator-issue-static-count.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Given a topN parameter (ie 10) the GeneratorJob will fail to generate 
> anything new on the subsequent runs within the same process space.
> To reproduce the issue submit the GeneratorJob twice one after another to the 
> M/R framework. Second time will say it generated 0 URLs.
> This issue is due to the usage of the static count field 
> (org.apache.nutch.crawl.GeneratorReducer#count) to determine if the topN 
> value has been reached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2328) GeneratorJob does not generate anything on second run

Reply via email to