@Tejas +1

I think:

Keep Property
---------------------
-  generate.max.count. keep it because it still used GeneratorJob, Reducer.
-  GENERATOR_MAX_COUNT

Deprecate Property
------------------------------
- GENERATOR_MIN_SCORE
- GENERATOR_COUNT_VALUE_IP

Add in nutch-default.xml
-------------------------------------
- generate.min.score
- generate.filter
- generate.normalise
- generate.topN

Thanks
lufeng


On Mon, Feb 25, 2013 at 3:44 AM, Tejas Patil <[email protected]>wrote:

> Hi Lewis,
>
> We have not came to a conclusion for this topic.
> Here is what I propose:
> 1. keep "generate.max.count"
> 2. GENERATOR_MIN_SCORE and GENERATOR_MAX_COUNT: once we get to know that
> if they were kept back in 2.x for some valid reason, then we can safely
> remove these params. These seem to do nothing meaningful.
> 3. generate.min.score : remove ?
> 4. generate.filter, generate.normalise, generate.topN : there is not
> problem in keeping it. we can even remove it.
> 5. GENERATOR_COUNT_VALUE_IP : ??
>
> thanks,
> Tejas Patil
>
>
> On Wed, Feb 20, 2013 at 9:44 PM, Tejas Patil <[email protected]>wrote:
>
>> Hi Lufeng,
>>
>> On Wed, Feb 20, 2013 at 9:19 PM, feng lu <[email protected]> wrote:
>>
>>> Hi Tejas
>>>
>>> Yes , your are right. I misread the description of property
>>> "generate.count.mode". I'm so sorry, i did also not found any information
>>> about why disabled the IP based counting mode of "generate.count.mode".
>>>
>>> Yes, i see that the FetchEntryPartitioner class (combination
>>> of URLPartitioner) is used by FetcherJob. So as you say that the setting of
>>> "partition.url.mode"  is not effect to the GeneratorJob.
>>>
>>> Do you think we can add some detail description in the property of
>>> "generate.count.mode". such as
>>>
>>> <property>
>>>   <name>generate.count.mode</name>
>>>   <value>host</value>
>>>   <description>Determines how the URLs are counted for
>>> generator.max.count.
>>>   Default value is 'host' but can be 'domain'. Note that we do not count
>>>   per IP in the new version of the Generator. It will irrespective of
>>> the value of 'partition.url.mode' in GeneratorJob.
>>>   </description>
>>> </property>
>>>
>>> +1. This will help the users.
>>
>> Sorry for my bad English.
>>>
>> Thats fine. I am not perfect either :) There was a typo in my reply. I
>> missed few words or maybe accidentally they got deleted. Correction in
>> bold:
>> "There might be some reason behind removing it *and we must look into 
>> it*before adding it back
>> ".
>>
>>>
>>> Thanks
>>> lufeng
>>>
>>> On Thu, Feb 21, 2013 at 12:14 PM, Tejas Patil 
>>> <[email protected]>wrote:
>>>
>>>> Hi Lufeng,
>>>>
>>>> On Wed, Feb 20, 2013 at 7:16 PM, feng lu <[email protected]> wrote:
>>>>
>>>>> Hi Lewis
>>>>>
>>>>> Sorry, I am wrong, The GeneratorJob is only used in Nutch 2.x not 1.x.
>>>>>
>>>>> To the property of GENERATOR_COUNT_VALUE_IP, i think we can add a
>>>>> patch to GeneratorJob, instead of deprecated it. patch may like this.
>>>>>
>>>>> if (GENERATOR_COUNT_VALUE_HOST.equalsIgnoreCase(mode)) {
>>>>>       getConf().set(URLPartitioner.PARTITION_MODE_KEY,
>>>>> URLPartitioner.PARTITION_MODE_HOST);
>>>>>     } else if (GENERATOR_COUNT_VALUE_DOMAIN.equalsIgnoreCase(mode)) {
>>>>>         getConf().set(URLPartitioner.PARTITION_MODE_KEY,
>>>>> URLPartitioner.PARTITION_MODE_DOMAIN);
>>>>>     }
>>>>>     else if (GENERATOR_COUNT_VALUE_IP.equalsIgnoreCase(mode)) {
>>>>>         getConf().set(URLPartitioner.PARTITION_MODE_KEY,
>>>>> URLPartitioner.PARTITION_MODE_IP);
>>>>>     }
>>>>>     else {
>>>>>       LOG.warn("Unknown generator.max.count mode '" + mode + "', using
>>>>> mode=" + GENERATOR_COUNT_VALUE_HOST);
>>>>>       getConf().set(GENERATOR_COUNT_MODE, GENERATOR_COUNT_VALUE_HOST);
>>>>>       getConf().set(URLPartitioner.PARTITION_MODE_KEY,
>>>>> URLPartitioner.PARTITION_MODE_HOST);
>>>>>     }
>>>>>
>>>>> The description of property "generate.count.mode" says the IP based
>>>> counting has been disabled in the newer Generator version. There might be
>>>> some reason behind removing it before adding it back. I am searching out
>>>> for any relevant discussion(s) over @user / @dev  or Jira about this. If
>>>> you find anything, do share.
>>>>
>>>>
>>>>
>>>>> if we deprecated it, the URLPartitioner mode PARTITION_MODE_IP will
>>>>> never be setting even we set the partition.url.mode property to byIP in
>>>>> nutch-default.xml. Maybe the partition.url.mode property will be removed 
>>>>> in
>>>>> nutch-default.xml. Because it's depends on the value of
>>>>> GENERATOR_COUNT_MODE.
>>>>>
>>>>> How do your think please?
>>>>>
>>>>
>>>> The url partitioning is done not only in generate phase, but fetch
>>>> phase too. The mode of the URLPartitioner is defined by the param
>>>> "partition.url.mode" which can be by host, domain or ip. This works out
>>>> well for fetch phase as it supports partitioning of urls in all these
>>>> modes. For generate phase, the mode of the URLPartitioner is governed by
>>>> the value of "generate.count.mode" (irrespective of the value of 
>>>> "partition.url.mode").
>>>> This "hack" is implemented in GeneratorJob [0] at lines 176-183.
>>>>
>>>> [0] :
>>>> http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/crawl/GeneratorJob.java?view=markup
>>>>
>>>>>
>>>>> Thanks,
>>>>> lufeng
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Feb 21, 2013 at 10:26 AM, Tejas Patil <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hey Lewis,
>>>>>>
>>>>>> On Wed, Feb 20, 2013 at 1:05 PM, Lewis John Mcgibbney <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>> Following on from a discussion on user@ I dived into the
>>>>>>> GeneratorJob code and have the following general comment based on my
>>>>>>> observation... Usage of configuration options is really unstructured and
>>>>>>> loosely applied. This should not be the case. For example
>>>>>>>
>>>>>>> Observations
>>>>>>> ===========
>>>>>>>
>>>>>>> nutch-default.xml
>>>>>>> ---------------------
>>>>>>>  - generate.max.count property appears here but I cannot see for the
>>>>>>> life of me where it actually is used in the GeneratorJob, Mapper or 
>>>>>>> Reducer.
>>>>>>>
>>>>>>
>>>>>> Not sure if you are talking in terms of usage of the value of the
>>>>>> param in the code logic or practical application of the param for some 
>>>>>> use
>>>>>> case.
>>>>>> The GeneratorJob stores "generate.max.count" as "GENERATOR_MAX_COUNT"
>>>>>> and later this is picked up by GeneratorReducer in its local variable
>>>>>> "maxcount" which is used in reduce method. So I think that its been used 
>>>>>> in
>>>>>> generate phase. To be honest, I have never faced a situation where I had 
>>>>>> to
>>>>>> use it but I think that it might be helpful for some class of (rare)
>>>>>> scenarios.
>>>>>>
>>>>>>>
>>>>>>> Unused in GeneratorJob
>>>>>>> --------------------------------
>>>>>>>  - GENERATOR_MIN_SCORE - seems not be to used
>>>>>>>  - GENERATOR_MAX_COUNT - seems not be to used
>>>>>>>
>>>>>>
>>>>>> You are right. These are used in 1.X but not in 2.X. Not sure if this
>>>>>> is something that was intentionally left out in 2.x or got missed while 
>>>>>> 2.x
>>>>>> due to overlook. Do you have any idea ?
>>>>>>
>>>>>>>
>>>>>>> Missing in nutch-default.xml
>>>>>>> ------------------------------------
>>>>>>>  - generate.min.score - but used in GeneratorJob
>>>>>>>
>>>>>> Well as per earlier point, GeneratorJob  just picks this property and
>>>>>> stores in its local variable. Later aint used be either map or reduce for
>>>>>> any processing.
>>>>>>
>>>>>>  - generate.filter - set to true by default and available as a CLI
>>>>>>> override but should also be specified in nutch-default.xml
>>>>>>>  - generate.normalise - set to true by default and available as a
>>>>>>> CLI override but should also be specified in nutch-default.xml
>>>>>>>  - generate.topN - set to 263-1 by default and available as a CLI
>>>>>>> override but should also be specified in nutch-default.xml
>>>>>>>
>>>>>>> Suggestions to add
>>>>>>> --------------------------
>>>>>>>  - GENERATOR_COUNT_VALUE_IP - We should add a @Deprecated on this
>>>>>>> static element. I am not sure if it is used... I don't think it is.
>>>>>>>
>>>>>>> It is not used. In my opinion, I would favor removal of such things.
>>>>>> There was some discussion going on over the user group to remove such
>>>>>> deprecated properties from nutch-default.xml to avoid confusion. (see 
>>>>>> [1]).
>>>>>> The corresponding jira [2] was limited to the configs discussed over [1].
>>>>>> Maybe this discussion can be regarded as an extension/continuation for 
>>>>>> that
>>>>>> jira. What say ?
>>>>>>
>>>>>> Any comments on this please?
>>>>>>>
>>>>>>> [0]
>>>>>>> http://www.mail-archive.com/user%40nutch.apache.org/msg08854.html
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> *Lewis*
>>>>>>>
>>>>>>
>>>>>> [1] :
>>>>>> http://lucene.472066.n3.nabble.com/generate-max-count-was-not-affected-td4031013.html
>>>>>> [2] : https://issues.apache.org/jira/browse/NUTCH-1409
>>>>>>
>>>>>> Thanks,
>>>>>> Tejas Patil
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Don't Grow Old, Grow Up... :-)
>>>>>
>>>>
>>>> Thanks,
>>>> Tejas Patil
>>>>
>>>
>>>
>>>
>>> --
>>> Don't Grow Old, Grow Up... :-)
>>>
>>
>>
>


-- 
Don't Grow Old, Grow Up... :-)

Reply via email to