Every fetch request from GAE includes the appid as a header... you
obviously see it yourself, which is how you know the appid of the
crawler.  This is how Google enables you to block applications; just
block all requests with that particular header.

Jeff

On Wed, Jul 25, 2012 at 9:35 AM, jswap <[email protected]> wrote:
> I run a website containing lots of doctor-related data.  We get crawled by
> rogue crawlers from thousands of IP addresses DAILY (mostly in Russia) and
> we sometimes see our content show up on other websites.  I define a crawler
> as "rogue" when it does not obey robots.txt exclusions, and the crawling
> company offers no benefit to us and just sucks up system resources.
>
> Google App Engine is hosting a crawler (appid: s~steprep) that is similar to
> the Russian ones we block.  This crawler crawls us aggressively, sucks up
> system resources, ignores the robots.txt file, and offers no benefit to us.
> Per our usual policy, we have been blocking the hundreds of Google IP
> addresses that this crawler is crawling from.  The problem is that one or
> more of these IP addresses also hosts Google's "PageSpeed Insights" page,
> located here: https://developers.google.com/speed/pagespeed/insights
>
> My questions for Google are:
> 1 - Is it your intention that websites be unable to block crawlers that you
> host?
> 2 - Is it your intention that websites must allow the steprep crawler in
> exchange for using the PageSpeed Insights tool?
>
> Some people may suggest "why not just ask the company crawling you to stop
> crawling you?"
> 1 - Some companies ignore the request.
> 2 - Some companies temporarily stop crawling, then show up again a few days
> or weeks later, at which point I have to waste time dealing with it all over
> again.
>
> If we were to allow every crawler to crawl our site, our server would be
> brought to its knees.  I'm not going to waste money on increasing server
> resources just so more crawlers can scrape our data.  Website owners need a
> mechanism for blocking rogue crawlers, even when they are hosted by Google
> App Engine.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/google-appengine/-/Bo8u134CRr8J.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected].
> For more options, visit this group at
> http://groups.google.com/group/google-appengine?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Reply via email to