Hi Kumar, 

Am happy that that was of use to you. Sadly I have no feel for what the "depth" 
argument does, I don't tend to ever use it, I tend to use nutch's more specific 
commands: inject, generate, fetch, updatedb, merge, etc ...

Perhaps someone else could shed light on the crawl command. 

Regards, and happy new years!

Mischa
On 8 Jan 2010, at 11:49, Kumar Krishnasami wrote:

> Thanks, Mischa. That worked!!
> 
> So, it looks like once this config property is set, crawl ignores the 'depth' 
> argument. Even if I set 'depth' to 2, 3 etc., it will never crawl any of the 
> outlinks. Is that correct?
> 
> Regards,
> Kumar.
> 
> Mischa Tuffield wrote:
>> Hello Kumar, 
>> There is a config property you can set in conf/nutch-site.xml, as follows : 
>> <!-- 
>> <property>
>>  <name>db.max.outlinks.per.page</name>
>>  <value>0</value>
>>  <description>The maximum number of outlinks that we'll process for a page.
>>  If this value is nonnegative (>=0), at most db.max.outlinks.per.page 
>> outlinks
>>  will be processed for a page; otherwise, all outlinks will be processed.
>>  </description>
>> </property>
>>              --> 
>> This will force nutch to only fetch items of depth "0", i.e. it wont attempt 
>> to follow any of the outlinks from pages you tell it to go and fetch.
>> 
>> Regards, 
>> Mischa
>> On 8 Jan 2010, at 10:59, Kumar Krishnasami wrote:
>> 
>>> Hi,
>>> 
>>> I am a newbie to nutch. Just started looking at. I have a requirement to 
>>> crawl and index only urls that are specified under the urls folder. I do 
>>> not want nutch to crawl to any depth beyond the ones that are listed in the 
>>> urls folder.
>>> 
>>> Can I accomplish this by setting the depth argument for 'crawl' to "0"?
>>> 
>>> If I set the depth to 0, I get a message that says "No URLs to fetch - 
>>> check your seed list and URL filters.".
>>> 
>>> Any help will be greatly appreciated.
>>> 
>>> Thanks,
>>> Kumar.
>> 
>> ___________________________________
>> Mischa Tuffield
>> Email: mischa.tuffi...@garlik.com <mailto:mischa.tuffi...@garlik.com>
>> Homepage - http://mmt.me.uk/
>> Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
>> +44(0)20 8973 2465  http://www.garlik.com/
>> Registered in England and Wales 535 7233 VAT # 849 0517 11
>> Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
>> 
> 

___________________________________
Mischa Tuffield
Email: mischa.tuffi...@garlik.com
Homepage - http://mmt.me.uk/
Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
+44(0)20 8973 2465  http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD

Reply via email to