@Tejas +1 I think:
Keep Property --------------------- - generate.max.count. keep it because it still used GeneratorJob, Reducer. - GENERATOR_MAX_COUNT Deprecate Property ------------------------------ - GENERATOR_MIN_SCORE - GENERATOR_COUNT_VALUE_IP Add in nutch-default.xml ------------------------------------- - generate.min.score - generate.filter - generate.normalise - generate.topN Thanks lufeng On Mon, Feb 25, 2013 at 3:44 AM, Tejas Patil <[email protected]>wrote: > Hi Lewis, > > We have not came to a conclusion for this topic. > Here is what I propose: > 1. keep "generate.max.count" > 2. GENERATOR_MIN_SCORE and GENERATOR_MAX_COUNT: once we get to know that > if they were kept back in 2.x for some valid reason, then we can safely > remove these params. These seem to do nothing meaningful. > 3. generate.min.score : remove ? > 4. generate.filter, generate.normalise, generate.topN : there is not > problem in keeping it. we can even remove it. > 5. GENERATOR_COUNT_VALUE_IP : ?? > > thanks, > Tejas Patil > > > On Wed, Feb 20, 2013 at 9:44 PM, Tejas Patil <[email protected]>wrote: > >> Hi Lufeng, >> >> On Wed, Feb 20, 2013 at 9:19 PM, feng lu <[email protected]> wrote: >> >>> Hi Tejas >>> >>> Yes , your are right. I misread the description of property >>> "generate.count.mode". I'm so sorry, i did also not found any information >>> about why disabled the IP based counting mode of "generate.count.mode". >>> >>> Yes, i see that the FetchEntryPartitioner class (combination >>> of URLPartitioner) is used by FetcherJob. So as you say that the setting of >>> "partition.url.mode" is not effect to the GeneratorJob. >>> >>> Do you think we can add some detail description in the property of >>> "generate.count.mode". such as >>> >>> <property> >>> <name>generate.count.mode</name> >>> <value>host</value> >>> <description>Determines how the URLs are counted for >>> generator.max.count. >>> Default value is 'host' but can be 'domain'. Note that we do not count >>> per IP in the new version of the Generator. It will irrespective of >>> the value of 'partition.url.mode' in GeneratorJob. >>> </description> >>> </property> >>> >>> +1. This will help the users. >> >> Sorry for my bad English. >>> >> Thats fine. I am not perfect either :) There was a typo in my reply. I >> missed few words or maybe accidentally they got deleted. Correction in >> bold: >> "There might be some reason behind removing it *and we must look into >> it*before adding it back >> ". >> >>> >>> Thanks >>> lufeng >>> >>> On Thu, Feb 21, 2013 at 12:14 PM, Tejas Patil >>> <[email protected]>wrote: >>> >>>> Hi Lufeng, >>>> >>>> On Wed, Feb 20, 2013 at 7:16 PM, feng lu <[email protected]> wrote: >>>> >>>>> Hi Lewis >>>>> >>>>> Sorry, I am wrong, The GeneratorJob is only used in Nutch 2.x not 1.x. >>>>> >>>>> To the property of GENERATOR_COUNT_VALUE_IP, i think we can add a >>>>> patch to GeneratorJob, instead of deprecated it. patch may like this. >>>>> >>>>> if (GENERATOR_COUNT_VALUE_HOST.equalsIgnoreCase(mode)) { >>>>> getConf().set(URLPartitioner.PARTITION_MODE_KEY, >>>>> URLPartitioner.PARTITION_MODE_HOST); >>>>> } else if (GENERATOR_COUNT_VALUE_DOMAIN.equalsIgnoreCase(mode)) { >>>>> getConf().set(URLPartitioner.PARTITION_MODE_KEY, >>>>> URLPartitioner.PARTITION_MODE_DOMAIN); >>>>> } >>>>> else if (GENERATOR_COUNT_VALUE_IP.equalsIgnoreCase(mode)) { >>>>> getConf().set(URLPartitioner.PARTITION_MODE_KEY, >>>>> URLPartitioner.PARTITION_MODE_IP); >>>>> } >>>>> else { >>>>> LOG.warn("Unknown generator.max.count mode '" + mode + "', using >>>>> mode=" + GENERATOR_COUNT_VALUE_HOST); >>>>> getConf().set(GENERATOR_COUNT_MODE, GENERATOR_COUNT_VALUE_HOST); >>>>> getConf().set(URLPartitioner.PARTITION_MODE_KEY, >>>>> URLPartitioner.PARTITION_MODE_HOST); >>>>> } >>>>> >>>>> The description of property "generate.count.mode" says the IP based >>>> counting has been disabled in the newer Generator version. There might be >>>> some reason behind removing it before adding it back. I am searching out >>>> for any relevant discussion(s) over @user / @dev or Jira about this. If >>>> you find anything, do share. >>>> >>>> >>>> >>>>> if we deprecated it, the URLPartitioner mode PARTITION_MODE_IP will >>>>> never be setting even we set the partition.url.mode property to byIP in >>>>> nutch-default.xml. Maybe the partition.url.mode property will be removed >>>>> in >>>>> nutch-default.xml. Because it's depends on the value of >>>>> GENERATOR_COUNT_MODE. >>>>> >>>>> How do your think please? >>>>> >>>> >>>> The url partitioning is done not only in generate phase, but fetch >>>> phase too. The mode of the URLPartitioner is defined by the param >>>> "partition.url.mode" which can be by host, domain or ip. This works out >>>> well for fetch phase as it supports partitioning of urls in all these >>>> modes. For generate phase, the mode of the URLPartitioner is governed by >>>> the value of "generate.count.mode" (irrespective of the value of >>>> "partition.url.mode"). >>>> This "hack" is implemented in GeneratorJob [0] at lines 176-183. >>>> >>>> [0] : >>>> http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/crawl/GeneratorJob.java?view=markup >>>> >>>>> >>>>> Thanks, >>>>> lufeng >>>>> >>>>> >>>>> >>>>> On Thu, Feb 21, 2013 at 10:26 AM, Tejas Patil < >>>>> [email protected]> wrote: >>>>> >>>>>> Hey Lewis, >>>>>> >>>>>> On Wed, Feb 20, 2013 at 1:05 PM, Lewis John Mcgibbney < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hi, >>>>>>> Following on from a discussion on user@ I dived into the >>>>>>> GeneratorJob code and have the following general comment based on my >>>>>>> observation... Usage of configuration options is really unstructured and >>>>>>> loosely applied. This should not be the case. For example >>>>>>> >>>>>>> Observations >>>>>>> =========== >>>>>>> >>>>>>> nutch-default.xml >>>>>>> --------------------- >>>>>>> - generate.max.count property appears here but I cannot see for the >>>>>>> life of me where it actually is used in the GeneratorJob, Mapper or >>>>>>> Reducer. >>>>>>> >>>>>> >>>>>> Not sure if you are talking in terms of usage of the value of the >>>>>> param in the code logic or practical application of the param for some >>>>>> use >>>>>> case. >>>>>> The GeneratorJob stores "generate.max.count" as "GENERATOR_MAX_COUNT" >>>>>> and later this is picked up by GeneratorReducer in its local variable >>>>>> "maxcount" which is used in reduce method. So I think that its been used >>>>>> in >>>>>> generate phase. To be honest, I have never faced a situation where I had >>>>>> to >>>>>> use it but I think that it might be helpful for some class of (rare) >>>>>> scenarios. >>>>>> >>>>>>> >>>>>>> Unused in GeneratorJob >>>>>>> -------------------------------- >>>>>>> - GENERATOR_MIN_SCORE - seems not be to used >>>>>>> - GENERATOR_MAX_COUNT - seems not be to used >>>>>>> >>>>>> >>>>>> You are right. These are used in 1.X but not in 2.X. Not sure if this >>>>>> is something that was intentionally left out in 2.x or got missed while >>>>>> 2.x >>>>>> due to overlook. Do you have any idea ? >>>>>> >>>>>>> >>>>>>> Missing in nutch-default.xml >>>>>>> ------------------------------------ >>>>>>> - generate.min.score - but used in GeneratorJob >>>>>>> >>>>>> Well as per earlier point, GeneratorJob just picks this property and >>>>>> stores in its local variable. Later aint used be either map or reduce for >>>>>> any processing. >>>>>> >>>>>> - generate.filter - set to true by default and available as a CLI >>>>>>> override but should also be specified in nutch-default.xml >>>>>>> - generate.normalise - set to true by default and available as a >>>>>>> CLI override but should also be specified in nutch-default.xml >>>>>>> - generate.topN - set to 263-1 by default and available as a CLI >>>>>>> override but should also be specified in nutch-default.xml >>>>>>> >>>>>>> Suggestions to add >>>>>>> -------------------------- >>>>>>> - GENERATOR_COUNT_VALUE_IP - We should add a @Deprecated on this >>>>>>> static element. I am not sure if it is used... I don't think it is. >>>>>>> >>>>>>> It is not used. In my opinion, I would favor removal of such things. >>>>>> There was some discussion going on over the user group to remove such >>>>>> deprecated properties from nutch-default.xml to avoid confusion. (see >>>>>> [1]). >>>>>> The corresponding jira [2] was limited to the configs discussed over [1]. >>>>>> Maybe this discussion can be regarded as an extension/continuation for >>>>>> that >>>>>> jira. What say ? >>>>>> >>>>>> Any comments on this please? >>>>>>> >>>>>>> [0] >>>>>>> http://www.mail-archive.com/user%40nutch.apache.org/msg08854.html >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> *Lewis* >>>>>>> >>>>>> >>>>>> [1] : >>>>>> http://lucene.472066.n3.nabble.com/generate-max-count-was-not-affected-td4031013.html >>>>>> [2] : https://issues.apache.org/jira/browse/NUTCH-1409 >>>>>> >>>>>> Thanks, >>>>>> Tejas Patil >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Don't Grow Old, Grow Up... :-) >>>>> >>>> >>>> Thanks, >>>> Tejas Patil >>>> >>> >>> >>> >>> -- >>> Don't Grow Old, Grow Up... :-) >>> >> >> > -- Don't Grow Old, Grow Up... :-)

