Re: limit_rate based on User-Agent; how to exempt /robots.txt ?

Peter Booth via nginx Mon, 06 Aug 2018 22:56:28 -0700

So it’s very easy to get caught up in he trap if having unrealistic mental 
models of how we servers work when dealing with web servers. If your host is a 
recent (< 5 years) single Dickey host then you can probably support 300,000 
requests per second fir your robots.txt file. That’s because the file will be 
served from your Linux file ca he (memory)


Sent from my iPhone

> On Aug 6, 2018, at 10:45 PM, Cameron Kerr <cameron.k...@otago.ac.nz> wrote:
> 
> Hi all, I’ve recently deployed a rate-limiting configuration aimed at 
> protecting myself from spiders.
> 
> nginx version: nginx/1.15.1 (RPM from nginx.org)
> 
> I did this based on the excellent Nginx blog post at 
> https://www.nginx.com/blog/rate-limiting-nginx/ and have consulted the 
> documentation for limit_req and limit_req_zone.
> 
> I understand that you can have multiple zones in play, and that the 
> most-restrictive of all matches will apply for any matching request. I want 
> to go the other way though. I want to exempt /robots.txt from being rate 
> limited by spiders.
> 
> To put this in context, here is the gist of the relevant config, which aims 
> to implement a caching (and rate-limiting) layer in front of a much more 
> complex request routing layer (httpd).
> 
> http {
>    map $http_user_agent $user_agent_rate_key {
>        default "";
>        "~our-crawler" "wanted-robot";
>        "~*(bot/|crawler|robot|spider)" "robot";
>        "~ScienceBrowser/Nutch" "robot";
>        "~Arachni/" "robot";
>    }
> 
>    limit_req_zone $user_agent_rate_key zone=per_spider_class:1m rate=100r/m;
>    limit_req_status 429;
> 
>    server {
>        limit_req zone=per_spider_class;
> 
>        location / {
>            proxy_pass http://routing_layer_http/;
>        }
>    }
> }
> 
> 
> 
> Option 1: (working, but has issues)
> 
> Should I instead put the limit_req inside the "location / {}" stanza, and 
> have a separate "location /robots.txt {}" (or some generalised form using a 
> map) and not have limit_req inside that stanza
> 
> That would mean that any other configuration inside the location stanzas 
> would get duplicated, which would be a manageability concern. I just want to 
> override the limit_req.
> 
>    server {
>        location /robots.txt {
>            proxy_pass http://routing_layer_http/;
>        }
> 
>        location / {
>            limit_req zone=per_spider_class;
>            proxy_pass http://routing_layer_http/;
>        }
>    }
> 
> I've tested this, and it works.
> 
> 
> Option 2: (working, but has issues)
> 
> Should I create a "location /robots.txt {}" stanza that has a limit_req with 
> a high burst, say burst=500? It's not a whitelist, but perhaps something 
> still useful?
>    
> But I still end up with replicated location stanzas... I don't think I like 
> this approach.
> 
>    server {
>        limit_req zone=per_spider_class;
> 
>        location /robots.txt {
>            limit_req zone=per_spider_class burst=500;
>            proxy_pass https://routing_layer_https/;
>        }
> 
>        location / {
>            proxy_pass https://routing_layer_https/;
>        }
>    }
> 
> 
> Option 3: (does not work)
> 
> Some other way... perhaps I need to create some map that takes the path and 
> produces a $path_exempt variable, and then somehow use that with the 
> $user_agent_rate_key, returning "" when $path_exempt, or $user_agent_rate_key 
> otherwise.
> 
>    map $http_user_agent $user_agent_rate_key {
>        default "";
>        "~otago-crawler" "wanted-robot";
>        "~*(bot/|crawler|robot|spider)" "robot";
>        "~ScienceBrowser/Nutch" "robot";
>        "~Arachni/" "robot";
>    }
> 
>    map $uri $rate_for_spider_exempting {
>        default $user_agent_rate_key;
>        "/robots.txt" "";
>    }
> 
>    #limit_req_zone $user_agent_rate_key zone=per_spider_class:1m rate=100r/m;
>    limit_req_zone $rate_for_spider_exempting zone=per_spider_class:1m 
> rate=100r/m;
> 
> 
> However, this does not work because the second map is not returning 
> $user_agent_rate_key; the effect is that non-robots are affected (and the 
> load-balancer health-probes start getting rate-limited).
> 
> I'm guessing my reasoning of how this works is incorrect, or there is a 
> limitation or some sort of implicit ordering issue.
> 
> 
> Option 4: (does not work)
> 
> http://nginx.org/en/docs/http/ngx_http_core_module.html#limit_rate
> 
> I see that there is a variable $limit_rate that can be used, and this would 
> seem to be the cleanest, except in testing it doesn't seem to work (still 
> gets 429 responses as a User-Agent that is a bot)
> 
>    server {
>        limit_req zone=per_spider_class;
> 
>        location /robots.txt {
>            set $limit_rate 0;
>        }
> 
>        location / {
>            proxy_pass http://routing_layer_http/;
>        }
>    }
> 
> 
> I'm still fairly new with Nginx, so wanting something that decomposes cleanly 
> into an Nginx configuration. I would quite like to be able just have one 
> place where I specify the map of URLs I wish to exempt (I imagine there could 
> be others, such as ~/.well-known/something that could pop up).
> 
> Thank you very much for your time.
> 
> -- 
> Cameron Kerr
> Systems Engineer, Information Technology Services
> University of Otago
> 
> _______________________________________________
> nginx mailing list
> nginx@nginx.org
> http://mailman.nginx.org/mailman/listinfo/nginx
_______________________________________________
nginx mailing list
nginx@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx

Re: limit_rate based on User-Agent; how to exempt /robots.txt ?

Reply via email to