[ https://issues.apache.org/jira/browse/NUTCH-2536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ben Vachon updated NUTCH-2536: ------------------------------ Environment: Non-distributed, single node, standalone Nutch jobs run in a sinlge JVM with HBase as the data store. 2.3.1 > GeneratorReducer.count is a static variable > ------------------------------------------- > > Key: NUTCH-2536 > URL: https://issues.apache.org/jira/browse/NUTCH-2536 > Project: Nutch > Issue Type: Bug > Components: generator > Affects Versions: 2.3.1 > Environment: Non-distributed, single node, standalone Nutch jobs run > in a sinlge JVM with HBase as the data store. 2.3.1 > Reporter: Ben Vachon > Priority: Minor > Labels: Generate > Fix For: 2.4 > > Original Estimate: 2.4h > Remaining Estimate: 2.4h > > The count field of the GeneratorReducer class is a static field. This means > that if the GeneratorJob is run multiple times within the same JVM, it will > count all the webpages generated across all batches. > The count field is checked against the GeneratorJob's topN configuration > variable, which is described as: > "top threshold for maximum number of URLs permitted in a batch" > I understand this to mean that EACH batch should be capped at the topN value, > not ALL batches. > This isn't a problem with the way that Nutch is typically used because the > script starts a new JVM each time. I'm not using the script, I'm calling the > java classes directly (using the ToolRunner) within an existing JVM, so I'm > categorizing this as an SDK issue. > Changing the field to be non-static will not affect the behavior of the class > as its run by the script. -- This message was sent by Atlassian JIRA (v7.6.3#76005)