In my past experience, I was explicitly banned by about 60 sites (from 10000 in my "vertical" list!); via explicit instruction in their robots.txt file
After detailed analysis I found that about 50 sites were hosted on same IP address; I used fetch-per-TLD instead of fetch-per-IP. The rest sites were simply not willing to appear in my search results list - it's their right! -Fuad Tokenizer > -----Original Message----- > From: Ken Krugler [mailto:kkrugler_li...@transpac.com] > Sent: February-03-10 2:50 PM > To: nutch-user@lucene.apache.org > Subject: Re: A well-behaved crawler > > When you say "banned by several sites", do you mean that you get back > non-200 responses for pages that you know exist? Or something else? > > Also, there's another constraint that many sites impose, which is the > total number of page fetches/day. Unfortunately you don't know if > you've hit this until you run into problems. A good rule of thumb is > no more than 5K requests/day for a major site. > > -- Ken > > PS - You're not running in EC2 by any chance, are you? > > On Feb 3, 2010, at 2:21am, Sjaiful Bahri wrote: > > > "A well-behaved crawler needs to follow a set of loosely-defined > > behaviors to be 'polite' - don't crawl a site too fast, don't crawl > > any single IP address too fast, don't pull too much bandwidth from > > small sites by e.g. downloading tons of full res media that will > > never be indexed, meticulously obey robots.txt, identify itself with > > user-agent string that points to a detailed web page explaining the > > purpose of the bot, etc. " > > > > But my crawler still banned by several sites... :( > > > > cheers > > iful > > > > > > http://zipclue.com > > > > > > > > > > -------------------------------------------- > Ken Krugler > +1 530-210-6378 > http://bixolabs.com > e l a s t i c w e b m i n i n g > > >