I run a website containing lots of doctor-related data. We get crawled by rogue crawlers from thousands of IP addresses DAILY (mostly in Russia) and we sometimes see our content show up on other websites. I define a crawler as "rogue" when it does not obey robots.txt exclusions, and the crawling company offers no benefit to us and just sucks up system resources.
Google App Engine is hosting a crawler (appid: s~steprep) that is similar to the Russian ones we block. This crawler crawls us aggressively, sucks up system resources, ignores the robots.txt file, and offers no benefit to us. Per our usual policy, we have been blocking the hundreds of Google IP addresses that this crawler is crawling from. The problem is that one or more of these IP addresses also hosts Google's "PageSpeed Insights" page, located here: https://developers.google.com/speed/pagespeed/insights My questions for Google are: 1 - Is it your intention that websites be unable to block crawlers that you host? 2 - Is it your intention that websites must allow the steprep crawler in exchange for using the PageSpeed Insights tool? Some people may suggest "why not just ask the company crawling you to stop crawling you?" 1 - Some companies ignore the request. 2 - Some companies temporarily stop crawling, then show up again a few days or weeks later, at which point I have to waste time dealing with it all over again. If we were to allow every crawler to crawl our site, our server would be brought to its knees. I'm not going to waste money on increasing server resources just so more crawlers can scrape our data. Website owners need a mechanism for blocking rogue crawlers, even when they are hosted by Google App Engine. -- You received this message because you are subscribed to the Google Groups "Google App Engine" group. To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/Bo8u134CRr8J. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
