Hi Lufeng,
On Wed, Feb 20, 2013 at 7:16 PM, feng lu <[email protected]> wrote:
> Hi Lewis
>
> Sorry, I am wrong, The GeneratorJob is only used in Nutch 2.x not 1.x.
>
> To the property of GENERATOR_COUNT_VALUE_IP, i think we can add a patch to
> GeneratorJob, instead of deprecated it. patch may like this.
>
> if (GENERATOR_COUNT_VALUE_HOST.equalsIgnoreCase(mode)) {
> getConf().set(URLPartitioner.PARTITION_MODE_KEY,
> URLPartitioner.PARTITION_MODE_HOST);
> } else if (GENERATOR_COUNT_VALUE_DOMAIN.equalsIgnoreCase(mode)) {
> getConf().set(URLPartitioner.PARTITION_MODE_KEY,
> URLPartitioner.PARTITION_MODE_DOMAIN);
> }
> else if (GENERATOR_COUNT_VALUE_IP.equalsIgnoreCase(mode)) {
> getConf().set(URLPartitioner.PARTITION_MODE_KEY,
> URLPartitioner.PARTITION_MODE_IP);
> }
> else {
> LOG.warn("Unknown generator.max.count mode '" + mode + "', using
> mode=" + GENERATOR_COUNT_VALUE_HOST);
> getConf().set(GENERATOR_COUNT_MODE, GENERATOR_COUNT_VALUE_HOST);
> getConf().set(URLPartitioner.PARTITION_MODE_KEY,
> URLPartitioner.PARTITION_MODE_HOST);
> }
>
> The description of property "generate.count.mode" says the IP based
counting has been disabled in the newer Generator version. There might be
some reason behind removing it before adding it back. I am searching out
for any relevant discussion(s) over @user / @dev or Jira about this. If
you find anything, do share.
> if we deprecated it, the URLPartitioner mode PARTITION_MODE_IP will never
> be setting even we set the partition.url.mode property to byIP in
> nutch-default.xml. Maybe the partition.url.mode property will be removed in
> nutch-default.xml. Because it's depends on the value of
> GENERATOR_COUNT_MODE.
>
> How do your think please?
>
The url partitioning is done not only in generate phase, but fetch phase
too. The mode of the URLPartitioner is defined by the param
"partition.url.mode" which can be by host, domain or ip. This works out
well for fetch phase as it supports partitioning of urls in all these
modes. For generate phase, the mode of the URLPartitioner is governed by
the value of "generate.count.mode" (irrespective of the value of
"partition.url.mode").
This "hack" is implemented in GeneratorJob [0] at lines 176-183.
[0] :
http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/crawl/GeneratorJob.java?view=markup
>
> Thanks,
> lufeng
>
>
>
> On Thu, Feb 21, 2013 at 10:26 AM, Tejas Patil <[email protected]>wrote:
>
>> Hey Lewis,
>>
>> On Wed, Feb 20, 2013 at 1:05 PM, Lewis John Mcgibbney <
>> [email protected]> wrote:
>>
>>> Hi,
>>> Following on from a discussion on user@ I dived into the GeneratorJob
>>> code and have the following general comment based on my observation...
>>> Usage of configuration options is really unstructured and loosely applied.
>>> This should not be the case. For example
>>>
>>> Observations
>>> ===========
>>>
>>> nutch-default.xml
>>> ---------------------
>>> - generate.max.count property appears here but I cannot see for the
>>> life of me where it actually is used in the GeneratorJob, Mapper or Reducer.
>>>
>>
>> Not sure if you are talking in terms of usage of the value of the param
>> in the code logic or practical application of the param for some use case.
>> The GeneratorJob stores "generate.max.count" as "GENERATOR_MAX_COUNT" and
>> later this is picked up by GeneratorReducer in its local variable
>> "maxcount" which is used in reduce method. So I think that its been used in
>> generate phase. To be honest, I have never faced a situation where I had to
>> use it but I think that it might be helpful for some class of (rare)
>> scenarios.
>>
>>>
>>> Unused in GeneratorJob
>>> --------------------------------
>>> - GENERATOR_MIN_SCORE - seems not be to used
>>> - GENERATOR_MAX_COUNT - seems not be to used
>>>
>>
>> You are right. These are used in 1.X but not in 2.X. Not sure if this is
>> something that was intentionally left out in 2.x or got missed while 2.x
>> due to overlook. Do you have any idea ?
>>
>>>
>>> Missing in nutch-default.xml
>>> ------------------------------------
>>> - generate.min.score - but used in GeneratorJob
>>>
>> Well as per earlier point, GeneratorJob just picks this property and
>> stores in its local variable. Later aint used be either map or reduce for
>> any processing.
>>
>> - generate.filter - set to true by default and available as a CLI
>>> override but should also be specified in nutch-default.xml
>>> - generate.normalise - set to true by default and available as a CLI
>>> override but should also be specified in nutch-default.xml
>>> - generate.topN - set to 263-1 by default and available as a CLI
>>> override but should also be specified in nutch-default.xml
>>>
>>> Suggestions to add
>>> --------------------------
>>> - GENERATOR_COUNT_VALUE_IP - We should add a @Deprecated on this static
>>> element. I am not sure if it is used... I don't think it is.
>>>
>>> It is not used. In my opinion, I would favor removal of such things.
>> There was some discussion going on over the user group to remove such
>> deprecated properties from nutch-default.xml to avoid confusion. (see [1]).
>> The corresponding jira [2] was limited to the configs discussed over [1].
>> Maybe this discussion can be regarded as an extension/continuation for that
>> jira. What say ?
>>
>> Any comments on this please?
>>>
>>> [0] http://www.mail-archive.com/user%40nutch.apache.org/msg08854.html
>>
>>
>>
>>>
>>>
>>> --
>>> *Lewis*
>>>
>>
>> [1] :
>> http://lucene.472066.n3.nabble.com/generate-max-count-was-not-affected-td4031013.html
>> [2] : https://issues.apache.org/jira/browse/NUTCH-1409
>>
>> Thanks,
>> Tejas Patil
>>
>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>
Thanks,
Tejas Patil