Re: [google-appengine] Google App Engine, rogue crawlers, and PageSpeed Insights

Kate Thu, 02 Aug 2012 13:08:24 -0700

How can I block the following curl requests. Not every IP is different and 
I get 10s of 1000s of them every day.


Honestly I do not know HOW to block them. What method/code?


2012-08-02 15:03:21.103 / 405 55ms 0kb curl/7.18.2 (i386-redhat-linux-gnu) 
libcurl/7.18.2 NSS/3.12.2.0 zlib/1.2.3 libidn/0.6.14 libssh2/0.18

132.72.23.10 - - [02/Aug/2012:13:03:21 -0700] "HEAD / HTTP/1.1" 405 124 - 
"curl/7.18.2 (i386-redhat-linux-gnu) libcurl/7.18.2 NSS/3.12.2.0 zlib/1.2.3 
libidn/0.6.14 libssh2/0.18" "aussieclouds.appspot.com" ms=56 cpu_ms=0 
api_cpu_ms=0 cpm_usd=0.000045 instance=00c61b117c41a67b1b944a189d7cc38d5365564c 
<https://appengine.google.com/instances?app_id=aussieclouds&version_id=1.360754534133043769&key=00c61b117c41a67b1b944a189d7cc38d5365564c#00c61b117c41a67b1b944a189d7cc38d5365564c>



On Thursday, July 26, 2012 5:27:27 PM UTC-4, Jeff Schnitzer wrote:
>
> Every fetch request from GAE includes the appid as a header... you 
> obviously see it yourself, which is how you know the appid of the 
> crawler.  This is how Google enables you to block applications; just 
> block all requests with that particular header. 
>
> Jeff 
>
> On Wed, Jul 25, 2012 at 9:35 AM, jswap <[email protected]> wrote: 
> > I run a website containing lots of doctor-related data.  We get crawled 
> by 
> > rogue crawlers from thousands of IP addresses DAILY (mostly in Russia) 
> and 
> > we sometimes see our content show up on other websites.  I define a 
> crawler 
> > as "rogue" when it does not obey robots.txt exclusions, and the crawling 
> > company offers no benefit to us and just sucks up system resources. 
> > 
> > Google App Engine is hosting a crawler (appid: s~steprep) that is 
> similar to 
> > the Russian ones we block.  This crawler crawls us aggressively, sucks 
> up 
> > system resources, ignores the robots.txt file, and offers no benefit to 
> us. 
> > Per our usual policy, we have been blocking the hundreds of Google IP 
> > addresses that this crawler is crawling from.  The problem is that one 
> or 
> > more of these IP addresses also hosts Google's "PageSpeed Insights" 
> page, 
> > located here: https://developers.google.com/speed/pagespeed/insights 
> > 
> > My questions for Google are: 
> > 1 - Is it your intention that websites be unable to block crawlers that 
> you 
> > host? 
> > 2 - Is it your intention that websites must allow the steprep crawler in 
> > exchange for using the PageSpeed Insights tool? 
> > 
> > Some people may suggest "why not just ask the company crawling you to 
> stop 
> > crawling you?" 
> > 1 - Some companies ignore the request. 
> > 2 - Some companies temporarily stop crawling, then show up again a few 
> days 
> > or weeks later, at which point I have to waste time dealing with it all 
> over 
> > again. 
> > 
> > If we were to allow every crawler to crawl our site, our server would be 
> > brought to its knees.  I'm not going to waste money on increasing server 
> > resources just so more crawlers can scrape our data.  Website owners 
> need a 
> > mechanism for blocking rogue crawlers, even when they are hosted by 
> Google 
> > App Engine. 
> > 
> > -- 
> > You received this message because you are subscribed to the Google 
> Groups 
> > "Google App Engine" group. 
> > To view this discussion on the web visit 
> > https://groups.google.com/d/msg/google-appengine/-/Bo8u134CRr8J. 
> > To post to this group, send email to [email protected]. 
> > To unsubscribe from this group, send email to 
> > [email protected]. 
> > For more options, visit this group at 
> > http://groups.google.com/group/google-appengine?hl=en. 
>

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To view this discussion on the web visit 
https://groups.google.com/d/msg/google-appengine/-/uFcISD9ePFgJ.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Re: [google-appengine] Google App Engine, rogue crawlers, and PageSpeed Insights

Reply via email to