Hey Lewis, On Wed, Feb 20, 2013 at 1:05 PM, Lewis John Mcgibbney < [email protected]> wrote:
> Hi, > Following on from a discussion on user@ I dived into the GeneratorJob > code and have the following general comment based on my observation... > Usage of configuration options is really unstructured and loosely applied. > This should not be the case. For example > > Observations > =========== > > nutch-default.xml > --------------------- > - generate.max.count property appears here but I cannot see for the life > of me where it actually is used in the GeneratorJob, Mapper or Reducer. > Not sure if you are talking in terms of usage of the value of the param in the code logic or practical application of the param for some use case. The GeneratorJob stores "generate.max.count" as "GENERATOR_MAX_COUNT" and later this is picked up by GeneratorReducer in its local variable "maxcount" which is used in reduce method. So I think that its been used in generate phase. To be honest, I have never faced a situation where I had to use it but I think that it might be helpful for some class of (rare) scenarios. > > Unused in GeneratorJob > -------------------------------- > - GENERATOR_MIN_SCORE - seems not be to used > - GENERATOR_MAX_COUNT - seems not be to used > You are right. These are used in 1.X but not in 2.X. Not sure if this is something that was intentionally left out in 2.x or got missed while 2.x due to overlook. Do you have any idea ? > > Missing in nutch-default.xml > ------------------------------------ > - generate.min.score - but used in GeneratorJob > Well as per earlier point, GeneratorJob just picks this property and stores in its local variable. Later aint used be either map or reduce for any processing. - generate.filter - set to true by default and available as a CLI override > but should also be specified in nutch-default.xml > - generate.normalise - set to true by default and available as a CLI > override but should also be specified in nutch-default.xml > - generate.topN - set to 263-1 by default and available as a CLI > override but should also be specified in nutch-default.xml > > Suggestions to add > -------------------------- > - GENERATOR_COUNT_VALUE_IP - We should add a @Deprecated on this static > element. I am not sure if it is used... I don't think it is. > > It is not used. In my opinion, I would favor removal of such things. There was some discussion going on over the user group to remove such deprecated properties from nutch-default.xml to avoid confusion. (see [1]). The corresponding jira [2] was limited to the configs discussed over [1]. Maybe this discussion can be regarded as an extension/continuation for that jira. What say ? Any comments on this please? > > [0] http://www.mail-archive.com/user%40nutch.apache.org/msg08854.html > > > -- > *Lewis* > [1] : http://lucene.472066.n3.nabble.com/generate-max-count-was-not-affected-td4031013.html [2] : https://issues.apache.org/jira/browse/NUTCH-1409 Thanks, Tejas Patil

