When you say "banned by several sites", do you mean that you get back
non-200 responses for pages that you know exist? Or something else?
Also, there's another constraint that many sites impose, which is the
total number of page fetches/day. Unfortunately you don't know if
you've hit this until you run into problems. A good rule of thumb is
no more than 5K requests/day for a major site.
-- Ken
PS - You're not running in EC2 by any chance, are you?
On Feb 3, 2010, at 2:21am, Sjaiful Bahri wrote:
"A well-behaved crawler needs to follow a set of loosely-defined
behaviors to be 'polite' - don't crawl a site too fast, don't crawl
any single IP address too fast, don't pull too much bandwidth from
small sites by e.g. downloading tons of full res media that will
never be indexed, meticulously obey robots.txt, identify itself with
user-agent string that points to a detailed web page explaining the
purpose of the bot, etc. "
But my crawler still banned by several sites... :(
cheers
iful
http://zipclue.com
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g