All the urls that are qeued are crawled, the problem is that it doesnt look
further than depth 3 for urls so anything below that depth doesnt end up in
the segments. If I disable url filtering completely by removing it from
nutch-site.xml, it gets too many urls so I guess it is a problem with my
filter definition. I just can't seem to get the filter right.

reinhard schwab wrote:
> 
> and you miss some urls to be crawled? which?
> 
> with
> 
> bin/nutch readdb crawl/crawldb -dump <some directory>
> 
> you can dump the content of the crawl db into readable format.
> you will see there the next fetch times of the urls and the status.
> 
> with
> 
> bin/nutch readseg -dump crawl/segments/<segment_dir> <output_dir>
> 
> you can dump a segment into readable format
> and see which links have been extracted.
> 
> nutchcase schrieb:
>> Right, I have commented that part of the filter out and it gets urls with
>> queries, but only to a depth of 3. Here is my url filter:
>> -^(https|telnet|file|ftp|mailto):
>>
>> # skip some suffixes
>> -\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|i\
>> co|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>>
>> # skip URLs containing certain characters as probable queries, etc.
>> #-[...@=]
>>
>> # allow urls in foofactory.fi domain
>> +^http://([a-z0-9\-A-Z]*\.)*.foo.com/
>>
>> # deny anything else
>> #-.
>>  
>>
>> reinhard schwab wrote:
>>   
>>> the crawler has stopped fetching because all urls are already fetched.
>>> there are no unfetched urls left.
>>> do you expect to have more urls fetched?
>>>
>>> either you need more seed urls or you change your urf filters.
>>> the default nutch url filter configuration excludes the deep web, every
>>> url with a query part (?).
>>>
>>>
>>>     
>>
>>   
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/crawl-always-stops-at-depth%3D3-tp25981603p26012590.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to