Re: Crawl specific urls and depth argument

MilleBii Sat, 09 Jan 2010 01:12:26 -0800

I agree it is a miss-leading at first.

2010/1/9 Kumar Krishnasami <kumara...@vembu.com>


> Thanks, MilleBii. That explains it. All the docs I came across mentioned
> something like "-depth /depth/ indicates the link depth from the root page
> that should be crawled" (from
> http://lucene.apache.org/nutch/tutorial8.html).
>
>
>
> MilleBii wrote:
>
>> Depth argument is only used for the crawl command and basically is the
>> number of run cycles craw/fetch/update/index
>>
>> 2010/1/8, Mischa Tuffield <mischa.tuffi...@garlik.com>:
>>
>>
>>> Hi Kumar,
>>>
>>> Am happy that that was of use to you. Sadly I have no feel for what the
>>> "depth" argument does, I don't tend to ever use it, I tend to use nutch's
>>> more specific commands: inject, generate, fetch, updatedb, merge, etc ...
>>>
>>> Perhaps someone else could shed light on the crawl command.
>>>
>>> Regards, and happy new years!
>>>
>>> Mischa
>>> On 8 Jan 2010, at 11:49, Kumar Krishnasami wrote:
>>>
>>>
>>>
>>>> Thanks, Mischa. That worked!!
>>>>
>>>> So, it looks like once this config property is set, crawl ignores the
>>>> 'depth' argument. Even if I set 'depth' to 2, 3 etc., it will never
>>>> crawl
>>>> any of the outlinks. Is that correct?
>>>>
>>>> Regards,
>>>> Kumar.
>>>>
>>>> Mischa Tuffield wrote:
>>>>
>>>>
>>>>> Hello Kumar,
>>>>> There is a config property you can set in conf/nutch-site.xml, as
>>>>> follows
>>>>> :
>>>>> <!--
>>>>> <property>
>>>>>  <name>db.max.outlinks.per.page</name>
>>>>>  <value>0</value>
>>>>>  <description>The maximum number of outlinks that we'll process for a
>>>>> page.
>>>>>  If this value is nonnegative (>=0), at most db.max.outlinks.per.page
>>>>> outlinks
>>>>>  will be processed for a page; otherwise, all outlinks will be
>>>>> processed.
>>>>>  </description>
>>>>> </property>
>>>>>             -->
>>>>> This will force nutch to only fetch items of depth "0", i.e. it wont
>>>>> attempt to follow any of the outlinks from pages you tell it to go and
>>>>> fetch.
>>>>>
>>>>> Regards,
>>>>> Mischa
>>>>> On 8 Jan 2010, at 10:59, Kumar Krishnasami wrote:
>>>>>
>>>>>
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am a newbie to nutch. Just started looking at. I have a requirement
>>>>>> to
>>>>>> crawl and index only urls that are specified under the urls folder. I
>>>>>> do
>>>>>> not want nutch to crawl to any depth beyond the ones that are listed
>>>>>> in
>>>>>> the urls folder.
>>>>>>
>>>>>> Can I accomplish this by setting the depth argument for 'crawl' to
>>>>>> "0"?
>>>>>>
>>>>>> If I set the depth to 0, I get a message that says "No URLs to fetch -
>>>>>> check your seed list and URL filters.".
>>>>>>
>>>>>> Any help will be greatly appreciated.
>>>>>>
>>>>>> Thanks,
>>>>>> Kumar.
>>>>>>
>>>>>>
>>>>> ___________________________________
>>>>> Mischa Tuffield
>>>>> Email: mischa.tuffi...@garlik.com <mailto:mischa.tuffi...@garlik.com>
>>>>> Homepage - http://mmt.me.uk/
>>>>> Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
>>>>> +44(0)20 8973 2465  http://www.garlik.com/
>>>>> Registered in England and Wales 535 7233 VAT # 849 0517 11
>>>>> Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10
>>>>> 9AD
>>>>>
>>>>>
>>>>>
>>>> ___________________________________
>>> Mischa Tuffield
>>> Email: mischa.tuffi...@garlik.com
>>> Homepage - http://mmt.me.uk/
>>> Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
>>> +44(0)20 8973 2465  http://www.garlik.com/
>>> Registered in England and Wales 535 7233 VAT # 849 0517 11
>>> Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
>>>
>>>
>>>
>>>
>>
>>
>>
>>
>
>


-- 
-MilleBii-

Re: Crawl specific urls and depth argument

Reply via email to