RE: A well-behaved crawler

Fuad Efendi Wed, 03 Feb 2010 11:34:41 -0800

In my past experience, I was explicitly banned by about 60 sites (from 10000
in my "vertical" list!); via explicit instruction in their robots.txt file


After detailed analysis I found that about 50 sites were hosted on same IP
address; I used fetch-per-TLD instead of fetch-per-IP.

The rest sites were simply not willing to appear in my search results list -
it's their right! 


-Fuad
Tokenizer



> -----Original Message-----
> From: Ken Krugler [mailto:kkrugler_li...@transpac.com]
> Sent: February-03-10 2:50 PM
> To: nutch-user@lucene.apache.org
> Subject: Re: A well-behaved crawler
> 
> When you say "banned by several sites", do you mean that you get back
> non-200 responses for pages that you know exist? Or something else?
> 
> Also, there's another constraint that many sites impose, which is the
> total number of page fetches/day. Unfortunately you don't know if
> you've hit this until you run into problems. A good rule of thumb is
> no more than 5K requests/day for a major site.
> 
> -- Ken
> 
> PS - You're not running in EC2 by any chance, are you?
> 
> On Feb 3, 2010, at 2:21am, Sjaiful Bahri wrote:
> 
> > "A well-behaved crawler needs to follow a set of loosely-defined
> > behaviors to be 'polite' - don't crawl a site too fast, don't crawl
> > any single IP address too fast, don't pull too much bandwidth from
> > small sites by e.g. downloading tons of full res media that will
> > never be indexed, meticulously obey robots.txt, identify itself with
> > user-agent string that points to a detailed web page explaining the
> > purpose of the bot, etc. "
> >
> > But my crawler still banned by several sites... :(
> >
> > cheers
> > iful
> >
> >
> > http://zipclue.com
> >
> >
> >
> >
> 
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
> 
> 
>

RE: A well-behaved crawler

Reply via email to