Hi Lewis

Sorry, I am wrong, The GeneratorJob is only used in Nutch 2.x not 1.x.

To the property of GENERATOR_COUNT_VALUE_IP, i think we can add a patch to
GeneratorJob, instead of deprecated it. patch may like this.

if (GENERATOR_COUNT_VALUE_HOST.equalsIgnoreCase(mode)) {
      getConf().set(URLPartitioner.PARTITION_MODE_KEY,
URLPartitioner.PARTITION_MODE_HOST);
    } else if (GENERATOR_COUNT_VALUE_DOMAIN.equalsIgnoreCase(mode)) {
        getConf().set(URLPartitioner.PARTITION_MODE_KEY,
URLPartitioner.PARTITION_MODE_DOMAIN);
    }
    else if (GENERATOR_COUNT_VALUE_IP.equalsIgnoreCase(mode)) {
        getConf().set(URLPartitioner.PARTITION_MODE_KEY,
URLPartitioner.PARTITION_MODE_IP);
    }
    else {
      LOG.warn("Unknown generator.max.count mode '" + mode + "', using
mode=" + GENERATOR_COUNT_VALUE_HOST);
      getConf().set(GENERATOR_COUNT_MODE, GENERATOR_COUNT_VALUE_HOST);
      getConf().set(URLPartitioner.PARTITION_MODE_KEY,
URLPartitioner.PARTITION_MODE_HOST);
    }

if we deprecated it, the URLPartitioner mode PARTITION_MODE_IP will never
be setting even we set the partition.url.mode property to byIP in
nutch-default.xml. Maybe the partition.url.mode property will be removed in
nutch-default.xml. Because it's depends on the value of
GENERATOR_COUNT_MODE.

How do your think please?

Thanks,
lufeng



On Thu, Feb 21, 2013 at 10:26 AM, Tejas Patil <[email protected]>wrote:

> Hey Lewis,
>
> On Wed, Feb 20, 2013 at 1:05 PM, Lewis John Mcgibbney <
> [email protected]> wrote:
>
>> Hi,
>> Following on from a discussion on user@ I dived into the GeneratorJob
>> code and have the following general comment based on my observation...
>> Usage of configuration options is really unstructured and loosely applied.
>> This should not be the case. For example
>>
>> Observations
>> ===========
>>
>> nutch-default.xml
>> ---------------------
>>  - generate.max.count property appears here but I cannot see for the life
>> of me where it actually is used in the GeneratorJob, Mapper or Reducer.
>>
>
> Not sure if you are talking in terms of usage of the value of the param in
> the code logic or practical application of the param for some use case.
> The GeneratorJob stores "generate.max.count" as "GENERATOR_MAX_COUNT" and
> later this is picked up by GeneratorReducer in its local variable
> "maxcount" which is used in reduce method. So I think that its been used in
> generate phase. To be honest, I have never faced a situation where I had to
> use it but I think that it might be helpful for some class of (rare)
> scenarios.
>
>>
>> Unused in GeneratorJob
>> --------------------------------
>>  - GENERATOR_MIN_SCORE - seems not be to used
>>  - GENERATOR_MAX_COUNT - seems not be to used
>>
>
> You are right. These are used in 1.X but not in 2.X. Not sure if this is
> something that was intentionally left out in 2.x or got missed while 2.x
> due to overlook. Do you have any idea ?
>
>>
>> Missing in nutch-default.xml
>> ------------------------------------
>>  - generate.min.score - but used in GeneratorJob
>>
> Well as per earlier point, GeneratorJob  just picks this property and
> stores in its local variable. Later aint used be either map or reduce for
> any processing.
>
>  - generate.filter - set to true by default and available as a CLI
>> override but should also be specified in nutch-default.xml
>>  - generate.normalise - set to true by default and available as a CLI
>> override but should also be specified in nutch-default.xml
>>  - generate.topN - set to 263-1 by default and available as a CLI
>> override but should also be specified in nutch-default.xml
>>
>> Suggestions to add
>> --------------------------
>>  - GENERATOR_COUNT_VALUE_IP - We should add a @Deprecated on this static
>> element. I am not sure if it is used... I don't think it is.
>>
>> It is not used. In my opinion, I would favor removal of such things.
> There was some discussion going on over the user group to remove such
> deprecated properties from nutch-default.xml to avoid confusion. (see [1]).
> The corresponding jira [2] was limited to the configs discussed over [1].
> Maybe this discussion can be regarded as an extension/continuation for that
> jira. What say ?
>
> Any comments on this please?
>>
>> [0] http://www.mail-archive.com/user%40nutch.apache.org/msg08854.html
>
>
>
>>
>>
>> --
>> *Lewis*
>>
>
> [1] :
> http://lucene.472066.n3.nabble.com/generate-max-count-was-not-affected-td4031013.html
> [2] : https://issues.apache.org/jira/browse/NUTCH-1409
>
> Thanks,
> Tejas Patil
>



-- 
Don't Grow Old, Grow Up... :-)

Reply via email to