Ben Vachon created NUTCH-2536:
---------------------------------

             Summary: GeneratorReducer.count is a static variable
                 Key: NUTCH-2536
                 URL: https://issues.apache.org/jira/browse/NUTCH-2536
             Project: Nutch
          Issue Type: Bug
          Components: generator
    Affects Versions: 2.3.1
            Reporter: Ben Vachon
             Fix For: 2.4


The count field of the GeneratorReducer class is a static field. This means 
that if the GeneratorJob is run multiple times within the same JVM, it will 
count all the webpages generated across all batches.

The count field is checked against the GeneratorJob's topN configuration 
variable, which is described as:

"top threshold for maximum number of URLs permitted in a batch"

I understand this to mean that EACH batch should be capped at the topN value, 
not ALL batches.

This isn't a problem with the way that Nutch is typically used because the 
script starts a new JVM each time. I'm not using the script, I'm calling the 
java classes directly (using the ToolRunner) within an existing JVM, so I'm 
categorizing this as an SDK issue.

Changing the field to be non-static will not affect the behavior of the class 
as its run by the script.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to