[
https://issues.apache.org/jira/browse/NUTCH-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15586031#comment-15586031
]
Sebastian Nagel commented on NUTCH-2328:
----------------------------------------
I don't know what's specific to Spring for Hadoop but in (pseudo)distributed
mode an instance variable should be task-local and not shared across the
cluster. If understood correctly, the problem is strictly speaking not that
the variable is shared but that it survives the life cycle of a mapreduce task.
Normally there is only one Mapper or Reducer object per JVM. By configuration
there can be more in parallel threads but every task should have it's own
instance. If {{limit}} is per reduce task (= topN / numReducers) also {{count}}
should be. Otherwise with multiple reducers the generator stops too early. And
it must be per task because no globals are predictable in a distributed
environment: if reduce tasks fail, the global job counts can move backwards.
Btw., I'm even not sure whether counters in the task context are updated with
global job values.
> GeneratorJob does not generate anything on second run
> -----------------------------------------------------
>
> Key: NUTCH-2328
> URL: https://issues.apache.org/jira/browse/NUTCH-2328
> Project: Nutch
> Issue Type: Bug
> Components: generator
> Affects Versions: 2.2, 2.3, 2.2.1, 2.3.1
> Environment: Ubuntu 16.04 / Hadoop 2.7.1
> Reporter: Arthur B
> Labels: fails, generator, subsequent
> Fix For: 2.4
>
> Attachments: generator-issue-static-count.patch
>
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> Given a topN parameter (ie 10) the GeneratorJob will fail to generate
> anything new on the subsequent runs within the same process space.
> To reproduce the issue submit the GeneratorJob twice one after another to the
> M/R framework. Second time will say it generated 0 URLs.
> This issue is due to the usage of the static count field
> (org.apache.nutch.crawl.GeneratorReducer#count) to determine if the topN
> value has been reached.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)