What command are you running to crawl the web? If you are using 'bin/nutch crawl' command, then 'conf/crawl-urlfilter.txt' is used. Is that question mark after http://www.search.com a punctuation or it is a part of the URL? If it is a part of the URL, the second rule, [EMAIL PROTECTED], in 'conf/crawl-urlfilter.txt' is filtering it out. There can be a variety of reasons why your crawl is failing. Please read the 'logs/hadoop.log' file and see if you can find the cause of the error.
"Generator: 0 records selected for fetching, exiting ..." - You get this error if your depth value is high but there are no more URLs to fetch. This may happen in your case because the fetch in the first cycle fails. So no new URLs are discovered and as a result there are no URLs to fetch. Another possibility is that the first set of URLs fetched in the first cycle do not point to any other pages that is allowed by 'conf/crawl-urlfilter.txt'. Regards, Susam Pal On Jan 7, 2008 8:56 AM, <[EMAIL PROTECTED]> wrote: > why i can crawl http://game.search.com but i can't crawl > http://www.search.com? conf/crawl-urlfilter is > > # skip file:, ftp:, & mailto: urls > -^(file|ftp|mailto): > > # skip image and other suffixes we can't yet parse > #-\.(png|PNG|ico|ICO|css|sit|eps|wmf|zip|mpg|gz|rpm|tgz|mov|MOV|exe|bmp|BMP)$ > > # skip URLs containing certain characters as probable queries, etc. > [EMAIL PROTECTED] > > # skip URLs with slash-delimited segment that repeats 3+ times, to break loops > -.*(/.+?)/.*?\1/.*?\1/ > > # accept hosts in MY.DOMAIN.NAME > #+^http://([a-z0-9]*\.)*search.com/ > > # skip everything else > +. > > and some host i can't crawl because have error "Generator: 0 records selected > for fetching, exiting ..." i set the same config for all host.why? > > >
