Hey Lewis,

On Wed, Feb 20, 2013 at 1:05 PM, Lewis John Mcgibbney <
[email protected]> wrote:

> Hi,
> Following on from a discussion on user@ I dived into the GeneratorJob
> code and have the following general comment based on my observation...
> Usage of configuration options is really unstructured and loosely applied.
> This should not be the case. For example
>
> Observations
> ===========
>
> nutch-default.xml
> ---------------------
>  - generate.max.count property appears here but I cannot see for the life
> of me where it actually is used in the GeneratorJob, Mapper or Reducer.
>

Not sure if you are talking in terms of usage of the value of the param in
the code logic or practical application of the param for some use case.
The GeneratorJob stores "generate.max.count" as "GENERATOR_MAX_COUNT" and
later this is picked up by GeneratorReducer in its local variable
"maxcount" which is used in reduce method. So I think that its been used in
generate phase. To be honest, I have never faced a situation where I had to
use it but I think that it might be helpful for some class of (rare)
scenarios.

>
> Unused in GeneratorJob
> --------------------------------
>  - GENERATOR_MIN_SCORE - seems not be to used
>  - GENERATOR_MAX_COUNT - seems not be to used
>

You are right. These are used in 1.X but not in 2.X. Not sure if this is
something that was intentionally left out in 2.x or got missed while 2.x
due to overlook. Do you have any idea ?

>
> Missing in nutch-default.xml
> ------------------------------------
>  - generate.min.score - but used in GeneratorJob
>
Well as per earlier point, GeneratorJob  just picks this property and
stores in its local variable. Later aint used be either map or reduce for
any processing.

 - generate.filter - set to true by default and available as a CLI override
> but should also be specified in nutch-default.xml
>  - generate.normalise - set to true by default and available as a CLI
> override but should also be specified in nutch-default.xml
>  - generate.topN - set to 263-1 by default and available as a CLI
> override but should also be specified in nutch-default.xml
>
> Suggestions to add
> --------------------------
>  - GENERATOR_COUNT_VALUE_IP - We should add a @Deprecated on this static
> element. I am not sure if it is used... I don't think it is.
>
> It is not used. In my opinion, I would favor removal of such things. There
was some discussion going on over the user group to remove such deprecated
properties from nutch-default.xml to avoid confusion. (see [1]). The
corresponding jira [2] was limited to the configs discussed over [1]. Maybe
this discussion can be regarded as an extension/continuation for that jira.
What say ?

Any comments on this please?
>
> [0] http://www.mail-archive.com/user%40nutch.apache.org/msg08854.html



>
>
> --
> *Lewis*
>

[1] :
http://lucene.472066.n3.nabble.com/generate-max-count-was-not-affected-td4031013.html
[2] : https://issues.apache.org/jira/browse/NUTCH-1409

Thanks,
Tejas Patil

Reply via email to