When you say "banned by several sites", do you mean that you get back non-200 responses for pages that you know exist? Or something else?

Also, there's another constraint that many sites impose, which is the total number of page fetches/day. Unfortunately you don't know if you've hit this until you run into problems. A good rule of thumb is no more than 5K requests/day for a major site.

-- Ken

PS - You're not running in EC2 by any chance, are you?

On Feb 3, 2010, at 2:21am, Sjaiful Bahri wrote:

"A well-behaved crawler needs to follow a set of loosely-defined behaviors to be 'polite' - don't crawl a site too fast, don't crawl any single IP address too fast, don't pull too much bandwidth from small sites by e.g. downloading tons of full res media that will never be indexed, meticulously obey robots.txt, identify itself with user-agent string that points to a detailed web page explaining the purpose of the bot, etc. "

But my crawler still banned by several sites... :(

cheers
iful


http://zipclue.com





--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g




Reply via email to