I run a website containing lots of doctor-related data.  We get crawled by 
rogue crawlers from thousands of IP addresses DAILY (mostly in Russia) and 
we sometimes see our content show up on other websites.  I define a crawler 
as "rogue" when it does not obey robots.txt exclusions, and the crawling 
company offers no benefit to us and just sucks up system resources.

Google App Engine is hosting a crawler (appid: s~steprep) that is similar 
to the Russian ones we block.  This crawler crawls us aggressively, sucks 
up system resources, ignores the robots.txt file, and offers no benefit to 
us.  Per our normal policy, we have been blocking the dozens of Google IP 
addresses that this crawler is crawling from.  The problem is that one or 
more of these IP addresses also host Google's "PageSpeed Insights" page, 
located here: https://developers.google.com/speed/pagespeed/insights

My questions for Google are: 
1 - Is it your intention that websites be unable to block crawlers that you 
host?
2 - Is it your intention that websites must allow the steprep crawler in 
exchange for using the PageSpeed Insights tool?

Some people may suggest "why not just ask the company crawling you to stop 
crawling you?"
1 - Some companies ignore the request.
2 - Some companies temporarily stop crawling, then show up again a few days 
or weeks later, at which point I have to waste time dealing with it all 
over again.

If we were to allow every crawler to crawl our site, our server would be 
brought to its knees.  Website owners need a mechanism for blocking rogue 
crawlers, even when they are hosted by Google App Engine.

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To view this discussion on the web visit 
https://groups.google.com/d/msg/google-appengine/-/s-jzEmrI4BUJ.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Reply via email to