Hi Lewis,

We have not came to a conclusion for this topic.
Here is what I propose:
1. keep "generate.max.count"
2. GENERATOR_MIN_SCORE and GENERATOR_MAX_COUNT: once we get to know that if
they were kept back in 2.x for some valid reason, then we can safely remove
these params. These seem to do nothing meaningful.
3. generate.min.score : remove ?
4. generate.filter, generate.normalise, generate.topN : there is not
problem in keeping it. we can even remove it.
5. GENERATOR_COUNT_VALUE_IP : ??

thanks,
Tejas Patil


On Wed, Feb 20, 2013 at 9:44 PM, Tejas Patil <[email protected]>wrote:

> Hi Lufeng,
>
> On Wed, Feb 20, 2013 at 9:19 PM, feng lu <[email protected]> wrote:
>
>> Hi Tejas
>>
>> Yes , your are right. I misread the description of property
>> "generate.count.mode". I'm so sorry, i did also not found any information
>> about why disabled the IP based counting mode of "generate.count.mode".
>>
>> Yes, i see that the FetchEntryPartitioner class (combination
>> of URLPartitioner) is used by FetcherJob. So as you say that the setting of
>> "partition.url.mode"  is not effect to the GeneratorJob.
>>
>> Do you think we can add some detail description in the property of
>> "generate.count.mode". such as
>>
>> <property>
>>   <name>generate.count.mode</name>
>>   <value>host</value>
>>   <description>Determines how the URLs are counted for
>> generator.max.count.
>>   Default value is 'host' but can be 'domain'. Note that we do not count
>>   per IP in the new version of the Generator. It will irrespective of the
>> value of 'partition.url.mode' in GeneratorJob.
>>   </description>
>> </property>
>>
>> +1. This will help the users.
>
> Sorry for my bad English.
>>
> Thats fine. I am not perfect either :) There was a typo in my reply. I
> missed few words or maybe accidentally they got deleted. Correction in
> bold:
> "There might be some reason behind removing it *and we must look into 
> it*before adding it back
> ".
>
>>
>> Thanks
>> lufeng
>>
>> On Thu, Feb 21, 2013 at 12:14 PM, Tejas Patil 
>> <[email protected]>wrote:
>>
>>> Hi Lufeng,
>>>
>>> On Wed, Feb 20, 2013 at 7:16 PM, feng lu <[email protected]> wrote:
>>>
>>>> Hi Lewis
>>>>
>>>> Sorry, I am wrong, The GeneratorJob is only used in Nutch 2.x not 1.x.
>>>>
>>>> To the property of GENERATOR_COUNT_VALUE_IP, i think we can add a patch
>>>> to GeneratorJob, instead of deprecated it. patch may like this.
>>>>
>>>> if (GENERATOR_COUNT_VALUE_HOST.equalsIgnoreCase(mode)) {
>>>>       getConf().set(URLPartitioner.PARTITION_MODE_KEY,
>>>> URLPartitioner.PARTITION_MODE_HOST);
>>>>     } else if (GENERATOR_COUNT_VALUE_DOMAIN.equalsIgnoreCase(mode)) {
>>>>         getConf().set(URLPartitioner.PARTITION_MODE_KEY,
>>>> URLPartitioner.PARTITION_MODE_DOMAIN);
>>>>     }
>>>>     else if (GENERATOR_COUNT_VALUE_IP.equalsIgnoreCase(mode)) {
>>>>         getConf().set(URLPartitioner.PARTITION_MODE_KEY,
>>>> URLPartitioner.PARTITION_MODE_IP);
>>>>     }
>>>>     else {
>>>>       LOG.warn("Unknown generator.max.count mode '" + mode + "', using
>>>> mode=" + GENERATOR_COUNT_VALUE_HOST);
>>>>       getConf().set(GENERATOR_COUNT_MODE, GENERATOR_COUNT_VALUE_HOST);
>>>>       getConf().set(URLPartitioner.PARTITION_MODE_KEY,
>>>> URLPartitioner.PARTITION_MODE_HOST);
>>>>     }
>>>>
>>>> The description of property "generate.count.mode" says the IP based
>>> counting has been disabled in the newer Generator version. There might be
>>> some reason behind removing it before adding it back. I am searching out
>>> for any relevant discussion(s) over @user / @dev  or Jira about this. If
>>> you find anything, do share.
>>>
>>>
>>>
>>>> if we deprecated it, the URLPartitioner mode PARTITION_MODE_IP will
>>>> never be setting even we set the partition.url.mode property to byIP in
>>>> nutch-default.xml. Maybe the partition.url.mode property will be removed in
>>>> nutch-default.xml. Because it's depends on the value of
>>>> GENERATOR_COUNT_MODE.
>>>>
>>>> How do your think please?
>>>>
>>>
>>> The url partitioning is done not only in generate phase, but fetch phase
>>> too. The mode of the URLPartitioner is defined by the param
>>> "partition.url.mode" which can be by host, domain or ip. This works out
>>> well for fetch phase as it supports partitioning of urls in all these
>>> modes. For generate phase, the mode of the URLPartitioner is governed by
>>> the value of "generate.count.mode" (irrespective of the value of 
>>> "partition.url.mode").
>>> This "hack" is implemented in GeneratorJob [0] at lines 176-183.
>>>
>>> [0] :
>>> http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/crawl/GeneratorJob.java?view=markup
>>>
>>>>
>>>> Thanks,
>>>> lufeng
>>>>
>>>>
>>>>
>>>> On Thu, Feb 21, 2013 at 10:26 AM, Tejas Patil <[email protected]
>>>> > wrote:
>>>>
>>>>> Hey Lewis,
>>>>>
>>>>> On Wed, Feb 20, 2013 at 1:05 PM, Lewis John Mcgibbney <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi,
>>>>>> Following on from a discussion on user@ I dived into the
>>>>>> GeneratorJob code and have the following general comment based on my
>>>>>> observation... Usage of configuration options is really unstructured and
>>>>>> loosely applied. This should not be the case. For example
>>>>>>
>>>>>> Observations
>>>>>> ===========
>>>>>>
>>>>>> nutch-default.xml
>>>>>> ---------------------
>>>>>>  - generate.max.count property appears here but I cannot see for the
>>>>>> life of me where it actually is used in the GeneratorJob, Mapper or 
>>>>>> Reducer.
>>>>>>
>>>>>
>>>>> Not sure if you are talking in terms of usage of the value of the
>>>>> param in the code logic or practical application of the param for some use
>>>>> case.
>>>>> The GeneratorJob stores "generate.max.count" as "GENERATOR_MAX_COUNT" and
>>>>> later this is picked up by GeneratorReducer in its local variable
>>>>> "maxcount" which is used in reduce method. So I think that its been used 
>>>>> in
>>>>> generate phase. To be honest, I have never faced a situation where I had 
>>>>> to
>>>>> use it but I think that it might be helpful for some class of (rare)
>>>>> scenarios.
>>>>>
>>>>>>
>>>>>> Unused in GeneratorJob
>>>>>> --------------------------------
>>>>>>  - GENERATOR_MIN_SCORE - seems not be to used
>>>>>>  - GENERATOR_MAX_COUNT - seems not be to used
>>>>>>
>>>>>
>>>>> You are right. These are used in 1.X but not in 2.X. Not sure if this
>>>>> is something that was intentionally left out in 2.x or got missed while 
>>>>> 2.x
>>>>> due to overlook. Do you have any idea ?
>>>>>
>>>>>>
>>>>>> Missing in nutch-default.xml
>>>>>> ------------------------------------
>>>>>>  - generate.min.score - but used in GeneratorJob
>>>>>>
>>>>> Well as per earlier point, GeneratorJob  just picks this property and
>>>>> stores in its local variable. Later aint used be either map or reduce for
>>>>> any processing.
>>>>>
>>>>>  - generate.filter - set to true by default and available as a CLI
>>>>>> override but should also be specified in nutch-default.xml
>>>>>>  - generate.normalise - set to true by default and available as a CLI
>>>>>> override but should also be specified in nutch-default.xml
>>>>>>  - generate.topN - set to 263-1 by default and available as a CLI
>>>>>> override but should also be specified in nutch-default.xml
>>>>>>
>>>>>> Suggestions to add
>>>>>> --------------------------
>>>>>>  - GENERATOR_COUNT_VALUE_IP - We should add a @Deprecated on this
>>>>>> static element. I am not sure if it is used... I don't think it is.
>>>>>>
>>>>>> It is not used. In my opinion, I would favor removal of such things.
>>>>> There was some discussion going on over the user group to remove such
>>>>> deprecated properties from nutch-default.xml to avoid confusion. (see 
>>>>> [1]).
>>>>> The corresponding jira [2] was limited to the configs discussed over [1].
>>>>> Maybe this discussion can be regarded as an extension/continuation for 
>>>>> that
>>>>> jira. What say ?
>>>>>
>>>>> Any comments on this please?
>>>>>>
>>>>>> [0] http://www.mail-archive.com/user%40nutch.apache.org/msg08854.html
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *Lewis*
>>>>>>
>>>>>
>>>>> [1] :
>>>>> http://lucene.472066.n3.nabble.com/generate-max-count-was-not-affected-td4031013.html
>>>>> [2] : https://issues.apache.org/jira/browse/NUTCH-1409
>>>>>
>>>>> Thanks,
>>>>> Tejas Patil
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Don't Grow Old, Grow Up... :-)
>>>>
>>>
>>> Thanks,
>>> Tejas Patil
>>>
>>
>>
>>
>> --
>> Don't Grow Old, Grow Up... :-)
>>
>
>

Reply via email to