Hi Lufeng, On Wed, Feb 20, 2013 at 9:19 PM, feng lu <[email protected]> wrote:
> Hi Tejas > > Yes , your are right. I misread the description of property > "generate.count.mode". I'm so sorry, i did also not found any information > about why disabled the IP based counting mode of "generate.count.mode". > > Yes, i see that the FetchEntryPartitioner class (combination > of URLPartitioner) is used by FetcherJob. So as you say that the setting of > "partition.url.mode" is not effect to the GeneratorJob. > > Do you think we can add some detail description in the property of > "generate.count.mode". such as > > <property> > <name>generate.count.mode</name> > <value>host</value> > <description>Determines how the URLs are counted for generator.max.count. > Default value is 'host' but can be 'domain'. Note that we do not count > per IP in the new version of the Generator. It will irrespective of the > value of 'partition.url.mode' in GeneratorJob. > </description> > </property> > > +1. This will help the users. Sorry for my bad English. > Thats fine. I am not perfect either :) There was a typo in my reply. I missed few words or maybe accidentally they got deleted. Correction in bold: "There might be some reason behind removing it *and we must look into it*before adding it back ". > > Thanks > lufeng > > On Thu, Feb 21, 2013 at 12:14 PM, Tejas Patil <[email protected]>wrote: > >> Hi Lufeng, >> >> On Wed, Feb 20, 2013 at 7:16 PM, feng lu <[email protected]> wrote: >> >>> Hi Lewis >>> >>> Sorry, I am wrong, The GeneratorJob is only used in Nutch 2.x not 1.x. >>> >>> To the property of GENERATOR_COUNT_VALUE_IP, i think we can add a patch >>> to GeneratorJob, instead of deprecated it. patch may like this. >>> >>> if (GENERATOR_COUNT_VALUE_HOST.equalsIgnoreCase(mode)) { >>> getConf().set(URLPartitioner.PARTITION_MODE_KEY, >>> URLPartitioner.PARTITION_MODE_HOST); >>> } else if (GENERATOR_COUNT_VALUE_DOMAIN.equalsIgnoreCase(mode)) { >>> getConf().set(URLPartitioner.PARTITION_MODE_KEY, >>> URLPartitioner.PARTITION_MODE_DOMAIN); >>> } >>> else if (GENERATOR_COUNT_VALUE_IP.equalsIgnoreCase(mode)) { >>> getConf().set(URLPartitioner.PARTITION_MODE_KEY, >>> URLPartitioner.PARTITION_MODE_IP); >>> } >>> else { >>> LOG.warn("Unknown generator.max.count mode '" + mode + "', using >>> mode=" + GENERATOR_COUNT_VALUE_HOST); >>> getConf().set(GENERATOR_COUNT_MODE, GENERATOR_COUNT_VALUE_HOST); >>> getConf().set(URLPartitioner.PARTITION_MODE_KEY, >>> URLPartitioner.PARTITION_MODE_HOST); >>> } >>> >>> The description of property "generate.count.mode" says the IP based >> counting has been disabled in the newer Generator version. There might be >> some reason behind removing it before adding it back. I am searching out >> for any relevant discussion(s) over @user / @dev or Jira about this. If >> you find anything, do share. >> >> >> >>> if we deprecated it, the URLPartitioner mode PARTITION_MODE_IP will >>> never be setting even we set the partition.url.mode property to byIP in >>> nutch-default.xml. Maybe the partition.url.mode property will be removed in >>> nutch-default.xml. Because it's depends on the value of >>> GENERATOR_COUNT_MODE. >>> >>> How do your think please? >>> >> >> The url partitioning is done not only in generate phase, but fetch phase >> too. The mode of the URLPartitioner is defined by the param >> "partition.url.mode" which can be by host, domain or ip. This works out >> well for fetch phase as it supports partitioning of urls in all these >> modes. For generate phase, the mode of the URLPartitioner is governed by >> the value of "generate.count.mode" (irrespective of the value of >> "partition.url.mode"). >> This "hack" is implemented in GeneratorJob [0] at lines 176-183. >> >> [0] : >> http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/crawl/GeneratorJob.java?view=markup >> >>> >>> Thanks, >>> lufeng >>> >>> >>> >>> On Thu, Feb 21, 2013 at 10:26 AM, Tejas Patil >>> <[email protected]>wrote: >>> >>>> Hey Lewis, >>>> >>>> On Wed, Feb 20, 2013 at 1:05 PM, Lewis John Mcgibbney < >>>> [email protected]> wrote: >>>> >>>>> Hi, >>>>> Following on from a discussion on user@ I dived into the GeneratorJob >>>>> code and have the following general comment based on my observation... >>>>> Usage of configuration options is really unstructured and loosely applied. >>>>> This should not be the case. For example >>>>> >>>>> Observations >>>>> =========== >>>>> >>>>> nutch-default.xml >>>>> --------------------- >>>>> - generate.max.count property appears here but I cannot see for the >>>>> life of me where it actually is used in the GeneratorJob, Mapper or >>>>> Reducer. >>>>> >>>> >>>> Not sure if you are talking in terms of usage of the value of the param >>>> in the code logic or practical application of the param for some use case. >>>> The GeneratorJob stores "generate.max.count" as "GENERATOR_MAX_COUNT" and >>>> later this is picked up by GeneratorReducer in its local variable >>>> "maxcount" which is used in reduce method. So I think that its been used in >>>> generate phase. To be honest, I have never faced a situation where I had to >>>> use it but I think that it might be helpful for some class of (rare) >>>> scenarios. >>>> >>>>> >>>>> Unused in GeneratorJob >>>>> -------------------------------- >>>>> - GENERATOR_MIN_SCORE - seems not be to used >>>>> - GENERATOR_MAX_COUNT - seems not be to used >>>>> >>>> >>>> You are right. These are used in 1.X but not in 2.X. Not sure if this >>>> is something that was intentionally left out in 2.x or got missed while 2.x >>>> due to overlook. Do you have any idea ? >>>> >>>>> >>>>> Missing in nutch-default.xml >>>>> ------------------------------------ >>>>> - generate.min.score - but used in GeneratorJob >>>>> >>>> Well as per earlier point, GeneratorJob just picks this property and >>>> stores in its local variable. Later aint used be either map or reduce for >>>> any processing. >>>> >>>> - generate.filter - set to true by default and available as a CLI >>>>> override but should also be specified in nutch-default.xml >>>>> - generate.normalise - set to true by default and available as a CLI >>>>> override but should also be specified in nutch-default.xml >>>>> - generate.topN - set to 263-1 by default and available as a CLI >>>>> override but should also be specified in nutch-default.xml >>>>> >>>>> Suggestions to add >>>>> -------------------------- >>>>> - GENERATOR_COUNT_VALUE_IP - We should add a @Deprecated on this >>>>> static element. I am not sure if it is used... I don't think it is. >>>>> >>>>> It is not used. In my opinion, I would favor removal of such things. >>>> There was some discussion going on over the user group to remove such >>>> deprecated properties from nutch-default.xml to avoid confusion. (see [1]). >>>> The corresponding jira [2] was limited to the configs discussed over [1]. >>>> Maybe this discussion can be regarded as an extension/continuation for that >>>> jira. What say ? >>>> >>>> Any comments on this please? >>>>> >>>>> [0] http://www.mail-archive.com/user%40nutch.apache.org/msg08854.html >>>> >>>> >>>> >>>>> >>>>> >>>>> -- >>>>> *Lewis* >>>>> >>>> >>>> [1] : >>>> http://lucene.472066.n3.nabble.com/generate-max-count-was-not-affected-td4031013.html >>>> [2] : https://issues.apache.org/jira/browse/NUTCH-1409 >>>> >>>> Thanks, >>>> Tejas Patil >>>> >>> >>> >>> >>> -- >>> Don't Grow Old, Grow Up... :-) >>> >> >> Thanks, >> Tejas Patil >> > > > > -- > Don't Grow Old, Grow Up... :-) >

