I agree it is a miss-leading at first. 2010/1/9 Kumar Krishnasami <kumara...@vembu.com>
> Thanks, MilleBii. That explains it. All the docs I came across mentioned > something like "-depth /depth/ indicates the link depth from the root page > that should be crawled" (from > http://lucene.apache.org/nutch/tutorial8.html). > > > > MilleBii wrote: > >> Depth argument is only used for the crawl command and basically is the >> number of run cycles craw/fetch/update/index >> >> 2010/1/8, Mischa Tuffield <mischa.tuffi...@garlik.com>: >> >> >>> Hi Kumar, >>> >>> Am happy that that was of use to you. Sadly I have no feel for what the >>> "depth" argument does, I don't tend to ever use it, I tend to use nutch's >>> more specific commands: inject, generate, fetch, updatedb, merge, etc ... >>> >>> Perhaps someone else could shed light on the crawl command. >>> >>> Regards, and happy new years! >>> >>> Mischa >>> On 8 Jan 2010, at 11:49, Kumar Krishnasami wrote: >>> >>> >>> >>>> Thanks, Mischa. That worked!! >>>> >>>> So, it looks like once this config property is set, crawl ignores the >>>> 'depth' argument. Even if I set 'depth' to 2, 3 etc., it will never >>>> crawl >>>> any of the outlinks. Is that correct? >>>> >>>> Regards, >>>> Kumar. >>>> >>>> Mischa Tuffield wrote: >>>> >>>> >>>>> Hello Kumar, >>>>> There is a config property you can set in conf/nutch-site.xml, as >>>>> follows >>>>> : >>>>> <!-- >>>>> <property> >>>>> <name>db.max.outlinks.per.page</name> >>>>> <value>0</value> >>>>> <description>The maximum number of outlinks that we'll process for a >>>>> page. >>>>> If this value is nonnegative (>=0), at most db.max.outlinks.per.page >>>>> outlinks >>>>> will be processed for a page; otherwise, all outlinks will be >>>>> processed. >>>>> </description> >>>>> </property> >>>>> --> >>>>> This will force nutch to only fetch items of depth "0", i.e. it wont >>>>> attempt to follow any of the outlinks from pages you tell it to go and >>>>> fetch. >>>>> >>>>> Regards, >>>>> Mischa >>>>> On 8 Jan 2010, at 10:59, Kumar Krishnasami wrote: >>>>> >>>>> >>>>> >>>>>> Hi, >>>>>> >>>>>> I am a newbie to nutch. Just started looking at. I have a requirement >>>>>> to >>>>>> crawl and index only urls that are specified under the urls folder. I >>>>>> do >>>>>> not want nutch to crawl to any depth beyond the ones that are listed >>>>>> in >>>>>> the urls folder. >>>>>> >>>>>> Can I accomplish this by setting the depth argument for 'crawl' to >>>>>> "0"? >>>>>> >>>>>> If I set the depth to 0, I get a message that says "No URLs to fetch - >>>>>> check your seed list and URL filters.". >>>>>> >>>>>> Any help will be greatly appreciated. >>>>>> >>>>>> Thanks, >>>>>> Kumar. >>>>>> >>>>>> >>>>> ___________________________________ >>>>> Mischa Tuffield >>>>> Email: mischa.tuffi...@garlik.com <mailto:mischa.tuffi...@garlik.com> >>>>> Homepage - http://mmt.me.uk/ >>>>> Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK >>>>> +44(0)20 8973 2465 http://www.garlik.com/ >>>>> Registered in England and Wales 535 7233 VAT # 849 0517 11 >>>>> Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 >>>>> 9AD >>>>> >>>>> >>>>> >>>> ___________________________________ >>> Mischa Tuffield >>> Email: mischa.tuffi...@garlik.com >>> Homepage - http://mmt.me.uk/ >>> Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK >>> +44(0)20 8973 2465 http://www.garlik.com/ >>> Registered in England and Wales 535 7233 VAT # 849 0517 11 >>> Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD >>> >>> >>> >>> >> >> >> >> > > -- -MilleBii-