Hi Kumar, Am happy that that was of use to you. Sadly I have no feel for what the "depth" argument does, I don't tend to ever use it, I tend to use nutch's more specific commands: inject, generate, fetch, updatedb, merge, etc ...
Perhaps someone else could shed light on the crawl command. Regards, and happy new years! Mischa On 8 Jan 2010, at 11:49, Kumar Krishnasami wrote: > Thanks, Mischa. That worked!! > > So, it looks like once this config property is set, crawl ignores the 'depth' > argument. Even if I set 'depth' to 2, 3 etc., it will never crawl any of the > outlinks. Is that correct? > > Regards, > Kumar. > > Mischa Tuffield wrote: >> Hello Kumar, >> There is a config property you can set in conf/nutch-site.xml, as follows : >> <!-- >> <property> >> <name>db.max.outlinks.per.page</name> >> <value>0</value> >> <description>The maximum number of outlinks that we'll process for a page. >> If this value is nonnegative (>=0), at most db.max.outlinks.per.page >> outlinks >> will be processed for a page; otherwise, all outlinks will be processed. >> </description> >> </property> >> --> >> This will force nutch to only fetch items of depth "0", i.e. it wont attempt >> to follow any of the outlinks from pages you tell it to go and fetch. >> >> Regards, >> Mischa >> On 8 Jan 2010, at 10:59, Kumar Krishnasami wrote: >> >>> Hi, >>> >>> I am a newbie to nutch. Just started looking at. I have a requirement to >>> crawl and index only urls that are specified under the urls folder. I do >>> not want nutch to crawl to any depth beyond the ones that are listed in the >>> urls folder. >>> >>> Can I accomplish this by setting the depth argument for 'crawl' to "0"? >>> >>> If I set the depth to 0, I get a message that says "No URLs to fetch - >>> check your seed list and URL filters.". >>> >>> Any help will be greatly appreciated. >>> >>> Thanks, >>> Kumar. >> >> ___________________________________ >> Mischa Tuffield >> Email: mischa.tuffi...@garlik.com <mailto:mischa.tuffi...@garlik.com> >> Homepage - http://mmt.me.uk/ >> Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK >> +44(0)20 8973 2465 http://www.garlik.com/ >> Registered in England and Wales 535 7233 VAT # 849 0517 11 >> Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD >> > ___________________________________ Mischa Tuffield Email: mischa.tuffi...@garlik.com Homepage - http://mmt.me.uk/ Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK +44(0)20 8973 2465 http://www.garlik.com/ Registered in England and Wales 535 7233 VAT # 849 0517 11 Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD