Hi Lufeng,

On Wed, Feb 20, 2013 at 9:19 PM, feng lu <[email protected]> wrote:

> Hi Tejas
>
> Yes , your are right. I misread the description of property
> "generate.count.mode". I'm so sorry, i did also not found any information
> about why disabled the IP based counting mode of "generate.count.mode".
>
> Yes, i see that the FetchEntryPartitioner class (combination
> of URLPartitioner) is used by FetcherJob. So as you say that the setting of
> "partition.url.mode"  is not effect to the GeneratorJob.
>
> Do you think we can add some detail description in the property of
> "generate.count.mode". such as
>
> <property>
>   <name>generate.count.mode</name>
>   <value>host</value>
>   <description>Determines how the URLs are counted for generator.max.count.
>   Default value is 'host' but can be 'domain'. Note that we do not count
>   per IP in the new version of the Generator. It will irrespective of the
> value of 'partition.url.mode' in GeneratorJob.
>   </description>
> </property>
>
> +1. This will help the users.

Sorry for my bad English.
>
Thats fine. I am not perfect either :) There was a typo in my reply. I
missed few words or maybe accidentally they got deleted. Correction in
bold:
"There might be some reason behind removing it *and we must look into
it*before adding it back
".

>
> Thanks
> lufeng
>
> On Thu, Feb 21, 2013 at 12:14 PM, Tejas Patil <[email protected]>wrote:
>
>> Hi Lufeng,
>>
>> On Wed, Feb 20, 2013 at 7:16 PM, feng lu <[email protected]> wrote:
>>
>>> Hi Lewis
>>>
>>> Sorry, I am wrong, The GeneratorJob is only used in Nutch 2.x not 1.x.
>>>
>>> To the property of GENERATOR_COUNT_VALUE_IP, i think we can add a patch
>>> to GeneratorJob, instead of deprecated it. patch may like this.
>>>
>>> if (GENERATOR_COUNT_VALUE_HOST.equalsIgnoreCase(mode)) {
>>>       getConf().set(URLPartitioner.PARTITION_MODE_KEY,
>>> URLPartitioner.PARTITION_MODE_HOST);
>>>     } else if (GENERATOR_COUNT_VALUE_DOMAIN.equalsIgnoreCase(mode)) {
>>>         getConf().set(URLPartitioner.PARTITION_MODE_KEY,
>>> URLPartitioner.PARTITION_MODE_DOMAIN);
>>>     }
>>>     else if (GENERATOR_COUNT_VALUE_IP.equalsIgnoreCase(mode)) {
>>>         getConf().set(URLPartitioner.PARTITION_MODE_KEY,
>>> URLPartitioner.PARTITION_MODE_IP);
>>>     }
>>>     else {
>>>       LOG.warn("Unknown generator.max.count mode '" + mode + "', using
>>> mode=" + GENERATOR_COUNT_VALUE_HOST);
>>>       getConf().set(GENERATOR_COUNT_MODE, GENERATOR_COUNT_VALUE_HOST);
>>>       getConf().set(URLPartitioner.PARTITION_MODE_KEY,
>>> URLPartitioner.PARTITION_MODE_HOST);
>>>     }
>>>
>>> The description of property "generate.count.mode" says the IP based
>> counting has been disabled in the newer Generator version. There might be
>> some reason behind removing it before adding it back. I am searching out
>> for any relevant discussion(s) over @user / @dev  or Jira about this. If
>> you find anything, do share.
>>
>>
>>
>>> if we deprecated it, the URLPartitioner mode PARTITION_MODE_IP will
>>> never be setting even we set the partition.url.mode property to byIP in
>>> nutch-default.xml. Maybe the partition.url.mode property will be removed in
>>> nutch-default.xml. Because it's depends on the value of
>>> GENERATOR_COUNT_MODE.
>>>
>>> How do your think please?
>>>
>>
>> The url partitioning is done not only in generate phase, but fetch phase
>> too. The mode of the URLPartitioner is defined by the param
>> "partition.url.mode" which can be by host, domain or ip. This works out
>> well for fetch phase as it supports partitioning of urls in all these
>> modes. For generate phase, the mode of the URLPartitioner is governed by
>> the value of "generate.count.mode" (irrespective of the value of 
>> "partition.url.mode").
>> This "hack" is implemented in GeneratorJob [0] at lines 176-183.
>>
>> [0] :
>> http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/crawl/GeneratorJob.java?view=markup
>>
>>>
>>> Thanks,
>>> lufeng
>>>
>>>
>>>
>>> On Thu, Feb 21, 2013 at 10:26 AM, Tejas Patil 
>>> <[email protected]>wrote:
>>>
>>>> Hey Lewis,
>>>>
>>>> On Wed, Feb 20, 2013 at 1:05 PM, Lewis John Mcgibbney <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi,
>>>>> Following on from a discussion on user@ I dived into the GeneratorJob
>>>>> code and have the following general comment based on my observation...
>>>>> Usage of configuration options is really unstructured and loosely applied.
>>>>> This should not be the case. For example
>>>>>
>>>>> Observations
>>>>> ===========
>>>>>
>>>>> nutch-default.xml
>>>>> ---------------------
>>>>>  - generate.max.count property appears here but I cannot see for the
>>>>> life of me where it actually is used in the GeneratorJob, Mapper or 
>>>>> Reducer.
>>>>>
>>>>
>>>> Not sure if you are talking in terms of usage of the value of the param
>>>> in the code logic or practical application of the param for some use case.
>>>> The GeneratorJob stores "generate.max.count" as "GENERATOR_MAX_COUNT" and
>>>> later this is picked up by GeneratorReducer in its local variable
>>>> "maxcount" which is used in reduce method. So I think that its been used in
>>>> generate phase. To be honest, I have never faced a situation where I had to
>>>> use it but I think that it might be helpful for some class of (rare)
>>>> scenarios.
>>>>
>>>>>
>>>>> Unused in GeneratorJob
>>>>> --------------------------------
>>>>>  - GENERATOR_MIN_SCORE - seems not be to used
>>>>>  - GENERATOR_MAX_COUNT - seems not be to used
>>>>>
>>>>
>>>> You are right. These are used in 1.X but not in 2.X. Not sure if this
>>>> is something that was intentionally left out in 2.x or got missed while 2.x
>>>> due to overlook. Do you have any idea ?
>>>>
>>>>>
>>>>> Missing in nutch-default.xml
>>>>> ------------------------------------
>>>>>  - generate.min.score - but used in GeneratorJob
>>>>>
>>>> Well as per earlier point, GeneratorJob  just picks this property and
>>>> stores in its local variable. Later aint used be either map or reduce for
>>>> any processing.
>>>>
>>>>  - generate.filter - set to true by default and available as a CLI
>>>>> override but should also be specified in nutch-default.xml
>>>>>  - generate.normalise - set to true by default and available as a CLI
>>>>> override but should also be specified in nutch-default.xml
>>>>>  - generate.topN - set to 263-1 by default and available as a CLI
>>>>> override but should also be specified in nutch-default.xml
>>>>>
>>>>> Suggestions to add
>>>>> --------------------------
>>>>>  - GENERATOR_COUNT_VALUE_IP - We should add a @Deprecated on this
>>>>> static element. I am not sure if it is used... I don't think it is.
>>>>>
>>>>> It is not used. In my opinion, I would favor removal of such things.
>>>> There was some discussion going on over the user group to remove such
>>>> deprecated properties from nutch-default.xml to avoid confusion. (see [1]).
>>>> The corresponding jira [2] was limited to the configs discussed over [1].
>>>> Maybe this discussion can be regarded as an extension/continuation for that
>>>> jira. What say ?
>>>>
>>>> Any comments on this please?
>>>>>
>>>>> [0] http://www.mail-archive.com/user%40nutch.apache.org/msg08854.html
>>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Lewis*
>>>>>
>>>>
>>>> [1] :
>>>> http://lucene.472066.n3.nabble.com/generate-max-count-was-not-affected-td4031013.html
>>>> [2] : https://issues.apache.org/jira/browse/NUTCH-1409
>>>>
>>>> Thanks,
>>>> Tejas Patil
>>>>
>>>
>>>
>>>
>>> --
>>> Don't Grow Old, Grow Up... :-)
>>>
>>
>> Thanks,
>> Tejas Patil
>>
>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>

Reply via email to