405 is being returned for these requests anyway. 

The incoming rate is <1 QPS - beside filling up your logs I'm not sure how, 
if at all, this is effecting your app.


On Friday, 3 August 2012 06:08:21 UTC+10, Kate wrote:
>
> How can I block the following curl requests. Not every IP is different and 
> I get 10s of 1000s of them every day.
>
> Honestly I do not know HOW to block them. What method/code?
>
>
> 2012-08-02 15:03:21.103 / 405 55ms 0kb curl/7.18.2 
> (i386-redhat-linux-gnu) libcurl/7.18.2 NSS/3.12.2.0 zlib/1.2.3 
> libidn/0.6.14 libssh2/0.18
>
> 132.72.23.10 - - [02/Aug/2012:13:03:21 -0700] "HEAD / HTTP/1.1" 405 124 - 
> "curl/7.18.2 (i386-redhat-linux-gnu) libcurl/7.18.2 NSS/3.12.2.0 zlib/1.2.3 
> libidn/0.6.14 libssh2/0.18" "aussieclouds.appspot.com" ms=56 cpu_ms=0 
> api_cpu_ms=0 cpm_usd=0.000045 
> instance=00c61b117c41a67b1b944a189d7cc38d5365564c 
> <https://appengine.google.com/instances?app_id=aussieclouds&version_id=1.360754534133043769&key=00c61b117c41a67b1b944a189d7cc38d5365564c#00c61b117c41a67b1b944a189d7cc38d5365564c>
>
>
>
> On Thursday, July 26, 2012 5:27:27 PM UTC-4, Jeff Schnitzer wrote:
>>
>> Every fetch request from GAE includes the appid as a header... you 
>> obviously see it yourself, which is how you know the appid of the 
>> crawler.  This is how Google enables you to block applications; just 
>> block all requests with that particular header. 
>>
>> Jeff 
>>
>> On Wed, Jul 25, 2012 at 9:35 AM, jswap <[email protected]> wrote: 
>> > I run a website containing lots of doctor-related data.  We get crawled 
>> by 
>> > rogue crawlers from thousands of IP addresses DAILY (mostly in Russia) 
>> and 
>> > we sometimes see our content show up on other websites.  I define a 
>> crawler 
>> > as "rogue" when it does not obey robots.txt exclusions, and the 
>> crawling 
>> > company offers no benefit to us and just sucks up system resources. 
>> > 
>> > Google App Engine is hosting a crawler (appid: s~steprep) that is 
>> similar to 
>> > the Russian ones we block.  This crawler crawls us aggressively, sucks 
>> up 
>> > system resources, ignores the robots.txt file, and offers no benefit to 
>> us. 
>> > Per our usual policy, we have been blocking the hundreds of Google IP 
>> > addresses that this crawler is crawling from.  The problem is that one 
>> or 
>> > more of these IP addresses also hosts Google's "PageSpeed Insights" 
>> page, 
>> > located here: https://developers.google.com/speed/pagespeed/insights 
>> > 
>> > My questions for Google are: 
>> > 1 - Is it your intention that websites be unable to block crawlers that 
>> you 
>> > host? 
>> > 2 - Is it your intention that websites must allow the steprep crawler 
>> in 
>> > exchange for using the PageSpeed Insights tool? 
>> > 
>> > Some people may suggest "why not just ask the company crawling you to 
>> stop 
>> > crawling you?" 
>> > 1 - Some companies ignore the request. 
>> > 2 - Some companies temporarily stop crawling, then show up again a few 
>> days 
>> > or weeks later, at which point I have to waste time dealing with it all 
>> over 
>> > again. 
>> > 
>> > If we were to allow every crawler to crawl our site, our server would 
>> be 
>> > brought to its knees.  I'm not going to waste money on increasing 
>> server 
>> > resources just so more crawlers can scrape our data.  Website owners 
>> need a 
>> > mechanism for blocking rogue crawlers, even when they are hosted by 
>> Google 
>> > App Engine. 
>> > 
>> > -- 
>> > You received this message because you are subscribed to the Google 
>> Groups 
>> > "Google App Engine" group. 
>> > To view this discussion on the web visit 
>> > https://groups.google.com/d/msg/google-appengine/-/Bo8u134CRr8J. 
>> > To post to this group, send email to [email protected]. 
>>
>> > To unsubscribe from this group, send email to 
>> > [email protected]. 
>> > For more options, visit this group at 
>> > http://groups.google.com/group/google-appengine?hl=en. 
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To view this discussion on the web visit 
https://groups.google.com/d/msg/google-appengine/-/RaQefanPnVMJ.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Reply via email to