Depth argument is only used for the crawl command and basically is the
number of run cycles craw/fetch/update/index

2010/1/8, Mischa Tuffield <mischa.tuffi...@garlik.com>:
> Hi Kumar,
>
> Am happy that that was of use to you. Sadly I have no feel for what the
> "depth" argument does, I don't tend to ever use it, I tend to use nutch's
> more specific commands: inject, generate, fetch, updatedb, merge, etc ...
>
> Perhaps someone else could shed light on the crawl command.
>
> Regards, and happy new years!
>
> Mischa
> On 8 Jan 2010, at 11:49, Kumar Krishnasami wrote:
>
>> Thanks, Mischa. That worked!!
>>
>> So, it looks like once this config property is set, crawl ignores the
>> 'depth' argument. Even if I set 'depth' to 2, 3 etc., it will never crawl
>> any of the outlinks. Is that correct?
>>
>> Regards,
>> Kumar.
>>
>> Mischa Tuffield wrote:
>>> Hello Kumar,
>>> There is a config property you can set in conf/nutch-site.xml, as follows
>>> :
>>> <!--
>>> <property>
>>>  <name>db.max.outlinks.per.page</name>
>>>  <value>0</value>
>>>  <description>The maximum number of outlinks that we'll process for a
>>> page.
>>>  If this value is nonnegative (>=0), at most db.max.outlinks.per.page
>>> outlinks
>>>  will be processed for a page; otherwise, all outlinks will be processed.
>>>  </description>
>>> </property>
>>>              -->
>>> This will force nutch to only fetch items of depth "0", i.e. it wont
>>> attempt to follow any of the outlinks from pages you tell it to go and
>>> fetch.
>>>
>>> Regards,
>>> Mischa
>>> On 8 Jan 2010, at 10:59, Kumar Krishnasami wrote:
>>>
>>>> Hi,
>>>>
>>>> I am a newbie to nutch. Just started looking at. I have a requirement to
>>>> crawl and index only urls that are specified under the urls folder. I do
>>>> not want nutch to crawl to any depth beyond the ones that are listed in
>>>> the urls folder.
>>>>
>>>> Can I accomplish this by setting the depth argument for 'crawl' to "0"?
>>>>
>>>> If I set the depth to 0, I get a message that says "No URLs to fetch -
>>>> check your seed list and URL filters.".
>>>>
>>>> Any help will be greatly appreciated.
>>>>
>>>> Thanks,
>>>> Kumar.
>>>
>>> ___________________________________
>>> Mischa Tuffield
>>> Email: mischa.tuffi...@garlik.com <mailto:mischa.tuffi...@garlik.com>
>>> Homepage - http://mmt.me.uk/
>>> Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
>>> +44(0)20 8973 2465  http://www.garlik.com/
>>> Registered in England and Wales 535 7233 VAT # 849 0517 11
>>> Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
>>>
>>
>
> ___________________________________
> Mischa Tuffield
> Email: mischa.tuffi...@garlik.com
> Homepage - http://mmt.me.uk/
> Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
> +44(0)20 8973 2465  http://www.garlik.com/
> Registered in England and Wales 535 7233 VAT # 849 0517 11
> Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
>
>


-- 
-MilleBii-

Reply via email to