Looks like your URL filters are OK. I was able to crawl to depth 2 on slashdot.
Have you turned on logging and looked for more clues in the logs? Howie > Date: Wed, 25 Jun 2008 23:18:27 +0530 > From: [EMAIL PROTECTED] > To: nutch-user@lucene.apache.org > Subject: Re: Crawling SLASHDOT.ORG > > Hi Howie, > > CRAWL-URLFILTER.TXT looks like this > > # The url filter file used by the crawl command. > > # Better for intranet crawling. > # Be sure to change MY.DOMAIN.NAME to your domain name. > > # Each non-comment, non-blank line contains a regular expression > # prefixed by '+' or '-'. The first matching pattern in the file > # determines whether a URL is included or ignored. If no pattern > # matches, the URL is ignored. > > # skip file:, ftp:, & mailto: urls > -^(file|ftp|mailto): > > # skip image and other suffixes we can't yet parse > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ > > # skip URLs containing certain characters as probable queries, etc. > [EMAIL PROTECTED] > > # skip URLs with slash-delimited segment that repeats 3+ times, to break > loops > -.*(/.+?)/.*?\1/.*?\1/ > > # accept hosts in MY.DOMAIN.NAME > #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ > > # skip everything else > +.* > > > > and my REGEX-URLFILTER.TXT looks like this.... > > # The default url filter. > # Better for whole-internet crawling. > > # Each non-comment, non-blank line contains a regular expression > # prefixed by '+' or '-'. The first matching pattern in the file > # determines whether a URL is included or ignored. If no pattern > # matches, the URL is ignored. > > # skip file: ftp: and mailto: urls > -^(file|ftp|mailto): > > # skip image and other suffixes we can't yet parse > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ > > # skip URLs containing certain characters as probable queries, etc. > [EMAIL PROTECTED] > > # skip URLs with slash-delimited segment that repeats 3+ times, to break > loops > -.*(/.+?)/.*?\1/.*?\1/ > > # accept anything else > +. > > > On Wed, Jun 25, 2008 at 11:15 PM, Howie Wang <[EMAIL PROTECTED]> wrote: > > > > > What does your crawl-urlfilter.txt or regex-urlfilter.txt look like? > > > > Howie > > > > > > > Date: Wed, 25 Jun 2008 23:00:12 +0530 > > > From: [EMAIL PROTECTED] > > > To: nutch-user@lucene.apache.org > > > Subject: Crawling SLASHDOT.ORG > > > > > > Hi, > > > > > > I am new to nutch . I have been trying to crawl "slashdot.org" . > > But > > > due to some unknown problems i am unable to crawl the site. > > > I am able to crawl any other site site (bbc,ndtv,cricbuzz etc)... > > but > > > when i try to crawl "slashdot.org" i get the following error ... > > > > > > "Generator: jobtracker is 'local', generating exactly one > > partition. > > > Generator: 0 records selected for fetching, exiting ... > > > Stopping at depth=1 - no more URLs to fetch." > > > > > > Can some one please help me out. > > > > > > > > > Thank you in advance > > > > > > Kranthi Reddy. B > > > > _________________________________________________________________ > > Need to know now? Get instant answers with Windows Live Messenger. > > > > http://www.windowslive.com/messenger/connect_your_way.html?ocid=TXT_TAGLM_WL_Refresh_messenger_062008 _________________________________________________________________ Need to know now? Get instant answers with Windows Live Messenger. http://www.windowslive.com/messenger/connect_your_way.html?ocid=TXT_TAGLM_WL_Refresh_messenger_062008