Google webmaster tools 

  https://www.google.com/webmasters/tools/home

lets you (amongst other things) submit sitemaps and see the crawl rate for 
your site (for the previous 90 days). There's also a form to report problems 
with how googlebot is accessing your site

  https://www.google.com/webmasters/tools/googlebot-report

The crawl rate is modified to try to avoid overloading your site, but given 
that GAE will just fire up more instances, then I guess googlebot thinks 
your site is built for such traffic and just keeps upping the crawl rate. 
You could try and mimic a site being killed by the crawler.... keep basic 
stats in memcache every time you get hit by googlebot (as idenified by 
request headers) and if the requests come too thick and fast, delay the 
responses, or simply return a 408 or maybe a 503 or 509 response, and my 
guess is you'll see the crawl rate back off pretty quickly.

  http://en.wikipedia.org/wiki/List_of_HTTP_status_codes

Would be nice if robots.txt or sitemap files let you specify a maximum crawl 
rate (cf RSS files), or perhaps people agreed on an HTTP status code for 
"we're close, but not THAT close..." response to tell crawlers to back off 
(418 perhaps:) but I don't expect those standards have moved very much 
recently...

--
T

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To view this discussion on the web visit 
https://groups.google.com/d/msg/google-appengine/-/92F2o_-16zMJ.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Reply via email to