and you miss some urls to be crawled? which? with
bin/nutch readdb crawl/crawldb -dump <some directory> you can dump the content of the crawl db into readable format. you will see there the next fetch times of the urls and the status. with bin/nutch readseg -dump crawl/segments/<segment_dir> <output_dir> you can dump a segment into readable format and see which links have been extracted. nutchcase schrieb: > Right, I have commented that part of the filter out and it gets urls with > queries, but only to a depth of 3. Here is my url filter: > -^(https|telnet|file|ftp|mailto): > > # skip some suffixes > -\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|i\ > co|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ > > # skip URLs containing certain characters as probable queries, etc. > #-[...@=] > > # allow urls in foofactory.fi domain > +^http://([a-z0-9\-A-Z]*\.)*.foo.com/ > > # deny anything else > #-. > > > reinhard schwab wrote: > >> the crawler has stopped fetching because all urls are already fetched. >> there are no unfetched urls left. >> do you expect to have more urls fetched? >> >> either you need more seed urls or you change your urf filters. >> the default nutch url filter configuration excludes the deep web, every >> url with a query part (?). >> >> >> > >