Sebastian Nagel commented on NUTCH-2328:

I don't know what's specific to Spring for Hadoop but in (pseudo)distributed 
mode an instance variable should be task-local and not shared across the 
cluster.  If understood correctly, the problem is strictly speaking not that 
the variable is shared but that it survives the life cycle of a mapreduce task. 
 Normally there is only one Mapper or Reducer object per JVM. By configuration 
there can be more in parallel threads but every task should have it's own 
instance. If {{limit}} is per reduce task (= topN / numReducers) also {{count}} 
should be. Otherwise with multiple reducers the generator stops too early. And 
it must be per task because no globals are predictable in a distributed 
environment: if reduce tasks fail, the global job counts can move backwards. 
Btw., I'm even not sure whether counters in the task context are updated with 
global job values.

> GeneratorJob does not generate anything on second run
> -----------------------------------------------------
>                 Key: NUTCH-2328
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2328
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 2.2, 2.3, 2.2.1, 2.3.1
>         Environment: Ubuntu 16.04 / Hadoop 2.7.1
>            Reporter: Arthur B
>              Labels: fails, generator, subsequent
>             Fix For: 2.4
>         Attachments: generator-issue-static-count.patch
>   Original Estimate: 24h
>  Remaining Estimate: 24h
> Given a topN parameter (ie 10) the GeneratorJob will fail to generate 
> anything new on the subsequent runs within the same process space.
> To reproduce the issue submit the GeneratorJob twice one after another to the 
> M/R framework. Second time will say it generated 0 URLs.
> This issue is due to the usage of the static count field 
> (org.apache.nutch.crawl.GeneratorReducer#count) to determine if the topN 
> value has been reached.

This message was sent by Atlassian JIRA

Reply via email to