[ https://issues.apache.org/jira/browse/NUTCH-2536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16402022#comment-16402022 ]
Ben Vachon commented on NUTCH-2536: ----------------------------------- pull request: https://github.com/apache/nutch/pull/298 > GeneratorReducer.count is a static variable > ------------------------------------------- > > Key: NUTCH-2536 > URL: https://issues.apache.org/jira/browse/NUTCH-2536 > Project: Nutch > Issue Type: Bug > Components: generator > Affects Versions: 2.3.1 > Reporter: Ben Vachon > Priority: Minor > Labels: Generate > Fix For: 2.4 > > Original Estimate: 2.4h > Remaining Estimate: 2.4h > > The count field of the GeneratorReducer class is a static field. This means > that if the GeneratorJob is run multiple times within the same JVM, it will > count all the webpages generated across all batches. > The count field is checked against the GeneratorJob's topN configuration > variable, which is described as: > "top threshold for maximum number of URLs permitted in a batch" > I understand this to mean that EACH batch should be capped at the topN value, > not ALL batches. > This isn't a problem with the way that Nutch is typically used because the > script starts a new JVM each time. I'm not using the script, I'm calling the > java classes directly (using the ToolRunner) within an existing JVM, so I'm > categorizing this as an SDK issue. > Changing the field to be non-static will not affect the behavior of the class > as its run by the script. -- This message was sent by Atlassian JIRA (v7.6.3#76005)