RE: Crawling SLASHDOT.ORG

Howie Wang Wed, 25 Jun 2008 11:16:21 -0700

Looks like your URL filters are OK. I was able to crawl to depth 2
on slashdot.


Have you turned on logging and looked for more clues in the logs?

Howie


> Date: Wed, 25 Jun 2008 23:18:27 +0530
> From: [EMAIL PROTECTED]
> To: nutch-user@lucene.apache.org
> Subject: Re: Crawling SLASHDOT.ORG
> 
> Hi Howie,
> 
> CRAWL-URLFILTER.TXT  looks like this
> 
>   # The url filter file used by the crawl command.
> 
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
> 
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
> 
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
> 
> # skip image and other suffixes we can't yet parse
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
> 
> # skip URLs containing certain characters as probable queries, etc.
> [EMAIL PROTECTED]
> 
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/.+?)/.*?\1/.*?\1/
> 
> # accept hosts in MY.DOMAIN.NAME
> #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> 
> # skip everything else
> +.*
> 
> 
> 
> and my REGEX-URLFILTER.TXT looks like this....
> 
> # The default url filter.
> # Better for whole-internet crawling.
> 
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
> 
> # skip file: ftp: and mailto: urls
> -^(file|ftp|mailto):
> 
> # skip image and other suffixes we can't yet parse
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
> 
> # skip URLs containing certain characters as probable queries, etc.
> [EMAIL PROTECTED]
> 
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/.+?)/.*?\1/.*?\1/
> 
> # accept anything else
> +.
> 
> 
> On Wed, Jun 25, 2008 at 11:15 PM, Howie Wang <[EMAIL PROTECTED]> wrote:
> 
> >
> > What does your crawl-urlfilter.txt or regex-urlfilter.txt look like?
> >
> > Howie
> >
> >
> > > Date: Wed, 25 Jun 2008 23:00:12 +0530
> > > From: [EMAIL PROTECTED]
> > > To: nutch-user@lucene.apache.org
> > > Subject: Crawling SLASHDOT.ORG
> > >
> > > Hi,
> > >
> > >      I am new to nutch . I have been trying to crawl "slashdot.org" .
> > But
> > > due to some unknown problems i am unable to crawl the site.
> > >      I am able to crawl any other site site (bbc,ndtv,cricbuzz etc)...
> > but
> > > when i try to crawl "slashdot.org" i get the following error ...
> > >
> > >         "Generator: jobtracker is 'local', generating exactly one
> > partition.
> > >          Generator: 0 records selected for fetching, exiting ...
> > >           Stopping at depth=1 - no more URLs to fetch."
> > >
> > >     Can some one please help me out.
> > >
> > >
> > > Thank you in advance
> > >
> > >  Kranthi Reddy. B
> >
> > _________________________________________________________________
> > Need to know now? Get instant answers with Windows Live Messenger.
> >
> > http://www.windowslive.com/messenger/connect_your_way.html?ocid=TXT_TAGLM_WL_Refresh_messenger_062008

_________________________________________________________________
Need to know now? Get instant answers with Windows Live Messenger.
http://www.windowslive.com/messenger/connect_your_way.html?ocid=TXT_TAGLM_WL_Refresh_messenger_062008

RE: Crawling SLASHDOT.ORG

Reply via email to