Every fetch request from GAE includes the appid as a header... you obviously see it yourself, which is how you know the appid of the crawler. This is how Google enables you to block applications; just block all requests with that particular header.
Jeff On Wed, Jul 25, 2012 at 9:35 AM, jswap <[email protected]> wrote: > I run a website containing lots of doctor-related data. We get crawled by > rogue crawlers from thousands of IP addresses DAILY (mostly in Russia) and > we sometimes see our content show up on other websites. I define a crawler > as "rogue" when it does not obey robots.txt exclusions, and the crawling > company offers no benefit to us and just sucks up system resources. > > Google App Engine is hosting a crawler (appid: s~steprep) that is similar to > the Russian ones we block. This crawler crawls us aggressively, sucks up > system resources, ignores the robots.txt file, and offers no benefit to us. > Per our usual policy, we have been blocking the hundreds of Google IP > addresses that this crawler is crawling from. The problem is that one or > more of these IP addresses also hosts Google's "PageSpeed Insights" page, > located here: https://developers.google.com/speed/pagespeed/insights > > My questions for Google are: > 1 - Is it your intention that websites be unable to block crawlers that you > host? > 2 - Is it your intention that websites must allow the steprep crawler in > exchange for using the PageSpeed Insights tool? > > Some people may suggest "why not just ask the company crawling you to stop > crawling you?" > 1 - Some companies ignore the request. > 2 - Some companies temporarily stop crawling, then show up again a few days > or weeks later, at which point I have to waste time dealing with it all over > again. > > If we were to allow every crawler to crawl our site, our server would be > brought to its knees. I'm not going to waste money on increasing server > resources just so more crawlers can scrape our data. Website owners need a > mechanism for blocking rogue crawlers, even when they are hosted by Google > App Engine. > > -- > You received this message because you are subscribed to the Google Groups > "Google App Engine" group. > To view this discussion on the web visit > https://groups.google.com/d/msg/google-appengine/-/Bo8u134CRr8J. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group at > http://groups.google.com/group/google-appengine?hl=en. -- You received this message because you are subscribed to the Google Groups "Google App Engine" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
