Hello! On Tue, Aug 07, 2018 at 02:45:02AM +0000, Cameron Kerr wrote:
> Hi all, I’ve recently deployed a rate-limiting configuration > aimed at protecting myself from spiders. > > nginx version: nginx/1.15.1 (RPM from nginx.org) > > I did this based on the excellent Nginx blog post at > https://www.nginx.com/blog/rate-limiting-nginx/ and have > consulted the documentation for limit_req and limit_req_zone. > > I understand that you can have multiple zones in play, and that > the most-restrictive of all matches will apply for any matching > request. I want to go the other way though. I want to exempt > /robots.txt from being rate limited by spiders. > > To put this in context, here is the gist of the relevant config, > which aims to implement a caching (and rate-limiting) layer in > front of a much more complex request routing layer (httpd). > > http { > map $http_user_agent $user_agent_rate_key { > default ""; > "~our-crawler" "wanted-robot"; > "~*(bot/|crawler|robot|spider)" "robot"; > "~ScienceBrowser/Nutch" "robot"; > "~Arachni/" "robot"; > } > > limit_req_zone $user_agent_rate_key zone=per_spider_class:1m rate=100r/m; > limit_req_status 429; > > server { > limit_req zone=per_spider_class; > > location / { > proxy_pass http://routing_layer_http/; > } > } > } > > > > Option 1: (working, but has issues) > > Should I instead put the limit_req inside the "location / {}" > stanza, and have a separate "location /robots.txt {}" (or some > generalised form using a map) and not have limit_req inside that > stanza > > That would mean that any other configuration inside the location > stanzas would get duplicated, which would be a manageability > concern. I just want to override the limit_req. > > server { > location /robots.txt { > proxy_pass http://routing_layer_http/; > } > > location / { > limit_req zone=per_spider_class; > proxy_pass http://routing_layer_http/; > } > } > > I've tested this, and it works. This is most simple and nginx-way: provide exact configurations in particular locations. And this is what I would recommend to use. [...] > Option 3: (does not work) > > Some other way... perhaps I need to create some map that takes > the path and produces a $path_exempt variable, and then somehow > use that with the $user_agent_rate_key, returning "" when > $path_exempt, or $user_agent_rate_key otherwise. > > map $http_user_agent $user_agent_rate_key { > default ""; > "~otago-crawler" "wanted-robot"; > "~*(bot/|crawler|robot|spider)" "robot"; > "~ScienceBrowser/Nutch" "robot"; > "~Arachni/" "robot"; > } > > map $uri $rate_for_spider_exempting { > default $user_agent_rate_key; > "/robots.txt" ""; > } > > #limit_req_zone $user_agent_rate_key zone=per_spider_class:1m rate=100r/m; > limit_req_zone $rate_for_spider_exempting zone=per_spider_class:1m > rate=100r/m; > > > However, this does not work because the second map is not > returning $user_agent_rate_key; the effect is that non-robots > are affected (and the load-balancer health-probes start getting > rate-limited). > > I'm guessing my reasoning of how this works is incorrect, or > there is a limitation or some sort of implicit ordering issue. This approach is expected to work fine (assuming you've used limit_req somewhere), and I've just tested the exact configuration snipped provided to be sure. If it doesn't work for you, the problem is likely elsewhere. > Option 4: (does not work) > > http://nginx.org/en/docs/http/ngx_http_core_module.html#limit_rate > > I see that there is a variable $limit_rate that can be used, and > this would seem to be the cleanest, except in testing it doesn't > seem to work (still gets 429 responses as a User-Agent that is a > bot) The limit_rate directive (and the $limit_rate variable) controls bandwidth, and it is completely unrelated to the limit_req module. -- Maxim Dounin http://mdounin.ru/ _______________________________________________ nginx mailing list nginx@nginx.org http://mailman.nginx.org/mailman/listinfo/nginx