and you miss some urls to be crawled? which?

with

bin/nutch readdb crawl/crawldb -dump <some directory>

you can dump the content of the crawl db into readable format.
you will see there the next fetch times of the urls and the status.

with

bin/nutch readseg -dump crawl/segments/<segment_dir> <output_dir>

you can dump a segment into readable format
and see which links have been extracted.

nutchcase schrieb:
> Right, I have commented that part of the filter out and it gets urls with
> queries, but only to a depth of 3. Here is my url filter:
> -^(https|telnet|file|ftp|mailto):
>
> # skip some suffixes
> -\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|i\
> co|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>
> # skip URLs containing certain characters as probable queries, etc.
> #-[...@=]
>
> # allow urls in foofactory.fi domain
> +^http://([a-z0-9\-A-Z]*\.)*.foo.com/
>
> # deny anything else
> #-.
>  
>
> reinhard schwab wrote:
>   
>> the crawler has stopped fetching because all urls are already fetched.
>> there are no unfetched urls left.
>> do you expect to have more urls fetched?
>>
>> either you need more seed urls or you change your urf filters.
>> the default nutch url filter configuration excludes the deep web, every
>> url with a query part (?).
>>
>>
>>     
>
>   

Reply via email to