How can I block the following curl requests. Not every IP is different and I get 10s of 1000s of them every day.
Honestly I do not know HOW to block them. What method/code? 2012-08-02 15:03:21.103 / 405 55ms 0kb curl/7.18.2 (i386-redhat-linux-gnu) libcurl/7.18.2 NSS/3.12.2.0 zlib/1.2.3 libidn/0.6.14 libssh2/0.18 132.72.23.10 - - [02/Aug/2012:13:03:21 -0700] "HEAD / HTTP/1.1" 405 124 - "curl/7.18.2 (i386-redhat-linux-gnu) libcurl/7.18.2 NSS/3.12.2.0 zlib/1.2.3 libidn/0.6.14 libssh2/0.18" "aussieclouds.appspot.com" ms=56 cpu_ms=0 api_cpu_ms=0 cpm_usd=0.000045 instance=00c61b117c41a67b1b944a189d7cc38d5365564c <https://appengine.google.com/instances?app_id=aussieclouds&version_id=1.360754534133043769&key=00c61b117c41a67b1b944a189d7cc38d5365564c#00c61b117c41a67b1b944a189d7cc38d5365564c> On Thursday, July 26, 2012 5:27:27 PM UTC-4, Jeff Schnitzer wrote: > > Every fetch request from GAE includes the appid as a header... you > obviously see it yourself, which is how you know the appid of the > crawler. This is how Google enables you to block applications; just > block all requests with that particular header. > > Jeff > > On Wed, Jul 25, 2012 at 9:35 AM, jswap <[email protected]> wrote: > > I run a website containing lots of doctor-related data. We get crawled > by > > rogue crawlers from thousands of IP addresses DAILY (mostly in Russia) > and > > we sometimes see our content show up on other websites. I define a > crawler > > as "rogue" when it does not obey robots.txt exclusions, and the crawling > > company offers no benefit to us and just sucks up system resources. > > > > Google App Engine is hosting a crawler (appid: s~steprep) that is > similar to > > the Russian ones we block. This crawler crawls us aggressively, sucks > up > > system resources, ignores the robots.txt file, and offers no benefit to > us. > > Per our usual policy, we have been blocking the hundreds of Google IP > > addresses that this crawler is crawling from. The problem is that one > or > > more of these IP addresses also hosts Google's "PageSpeed Insights" > page, > > located here: https://developers.google.com/speed/pagespeed/insights > > > > My questions for Google are: > > 1 - Is it your intention that websites be unable to block crawlers that > you > > host? > > 2 - Is it your intention that websites must allow the steprep crawler in > > exchange for using the PageSpeed Insights tool? > > > > Some people may suggest "why not just ask the company crawling you to > stop > > crawling you?" > > 1 - Some companies ignore the request. > > 2 - Some companies temporarily stop crawling, then show up again a few > days > > or weeks later, at which point I have to waste time dealing with it all > over > > again. > > > > If we were to allow every crawler to crawl our site, our server would be > > brought to its knees. I'm not going to waste money on increasing server > > resources just so more crawlers can scrape our data. Website owners > need a > > mechanism for blocking rogue crawlers, even when they are hosted by > Google > > App Engine. > > > > -- > > You received this message because you are subscribed to the Google > Groups > > "Google App Engine" group. > > To view this discussion on the web visit > > https://groups.google.com/d/msg/google-appengine/-/Bo8u134CRr8J. > > To post to this group, send email to [email protected]. > > To unsubscribe from this group, send email to > > [email protected]. > > For more options, visit this group at > > http://groups.google.com/group/google-appengine?hl=en. > -- You received this message because you are subscribed to the Google Groups "Google App Engine" group. To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/uFcISD9ePFgJ. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
