Re: crawl always stops at depth=3

reinhard schwab Wed, 21 Oct 2009 13:31:26 -0700

the crawler has stopped fetching because all urls are already fetched.
there are no unfetched urls left.
do you expect to have more urls fetched?


either you need more seed urls or you change your urf filters.
the default nutch url filter configuration excludes the deep web, every
url with a query part (?).


nutchcase schrieb:
> Here is the output from that:
> TOTAL urls:   297
> retry 0:      297
> min score:    0.0
> avg score:    0.023377104
> max score:    2.009
> status 2 (db_fetched):        295
> status 5 (db_redir_perm):     2
>
>
> reinhard schwab wrote:
>   
>> try
>>
>> bin/nutch readdb crawl/crawldb -stats
>>
>> are there any unfetched pages?
>>
>> nutchcase schrieb:
>>     
>>> My crawl always stops at depth=3. It gets documents but does not continue
>>> any
>>> further.
>>> Here is my nutch-site.xml
>>> <?xml version="1.0"?>
>>> <configuration>
>>> <property>
>>> <name>http.agent.name</name>
>>> <value>nutch-solr-integration</value>
>>> </property>
>>> <property>
>>> <name>generate.max.per.host</name>
>>> <value>1000</value>
>>> </property>
>>> <property>
>>> <name>plugin.includes</name>
>>> <value>protocol-http|urlfilter-(crawl|regex)|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnorma\
>>> lizer-(pass|regex|basic)</value>
>>> </property>
>>> <property>
>>> <name>db.max.outlinks.per.page</name>
>>> <value>1000</value>
>>> </property>
>>> </configuration>
>>>
>>>
>>>   
>>>       
>>
>>     
>
>

Re: crawl always stops at depth=3

Reply via email to