All the urls that are qeued are crawled, the problem is that it doesnt look further than depth 3 for urls so anything below that depth doesnt end up in the segments. If I disable url filtering completely by removing it from nutch-site.xml, it gets too many urls so I guess it is a problem with my filter definition. I just can't seem to get the filter right.
reinhard schwab wrote: > > and you miss some urls to be crawled? which? > > with > > bin/nutch readdb crawl/crawldb -dump <some directory> > > you can dump the content of the crawl db into readable format. > you will see there the next fetch times of the urls and the status. > > with > > bin/nutch readseg -dump crawl/segments/<segment_dir> <output_dir> > > you can dump a segment into readable format > and see which links have been extracted. > > nutchcase schrieb: >> Right, I have commented that part of the filter out and it gets urls with >> queries, but only to a depth of 3. Here is my url filter: >> -^(https|telnet|file|ftp|mailto): >> >> # skip some suffixes >> -\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|i\ >> co|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ >> >> # skip URLs containing certain characters as probable queries, etc. >> #-[...@=] >> >> # allow urls in foofactory.fi domain >> +^http://([a-z0-9\-A-Z]*\.)*.foo.com/ >> >> # deny anything else >> #-. >> >> >> reinhard schwab wrote: >> >>> the crawler has stopped fetching because all urls are already fetched. >>> there are no unfetched urls left. >>> do you expect to have more urls fetched? >>> >>> either you need more seed urls or you change your urf filters. >>> the default nutch url filter configuration excludes the deep web, every >>> url with a query part (?). >>> >>> >>> >> >> > > > -- View this message in context: http://www.nabble.com/crawl-always-stops-at-depth%3D3-tp25981603p26012590.html Sent from the Nutch - User mailing list archive at Nabble.com.
