Hello! On Sun, May 17, 2026 at 11:21:15AM -0500, Constantine A. Murenin wrote:
> My optimised and heavily-cached OpenGrok-based dev web site has finally > succumbed to the DDoS from the supposed AI abuse, so, I'm, reevaluating > resource usage and the applicable limits. > > My objective is to serve everyone as long as there is available capacity. > Instead of doing a cat and mouse game on blocking any specific identifier, > like User-Agent or IP or netblock or region, I want to block excessive > usage IFF the content isn't already cached. > > For example, the search results page on my OpenGrok may take 10ms to 50ms > to generate when the Lucene index is stored on an mfs, but once generated, > the page is basically "free" to serve, and I want keep serving the cached > entry even during what may look like a DDoS attack on my instance. > Otherwise, a Slashdot-like event could result in the most popular > combination of identifiers being promptly blocked, and legitimate users > being denied access, even when the content was actually "free" and wouldn't > have required any excessive resources to generate and serve. > > I've re-looked at http://freenginx.org/r/limit_req and > http://freenginx.org/r/limit_conn, but I don't see any way to exclude > cached content from still getting subjected to the limits. > > I think the standard route here may be to use an > http://freenginx.org/r/error_page exception handler, to automatically > handle the 503 errors thrown by limit_req and limit_conn, and continue > serving the content if cached, but I'm not quite certain how to integrate > it with http://freenginx.org/r/proxy_cache. Any suggestions? I don't think there is a good way to check if the particular request is going to be served from the cache or not. An obvious solution would be to introduce additional proxy layer after the cache, and apply limits there. Another possible solution might be to use proxy_cache with proxy_pass to a backend which always returns an error, and error_page to handle errors in a different location with limits, the same proxy_cache and proxy_pass to the real backend. > One option may be to use http://freenginx.org/r/proxy_store instead of > proxy_cache, but I'm not sure that'll work properly when I'm also caching > the search result pages, too, to account for the Slashdot-like events > (they're currently referred to as "When many people access the same link > simultaneously -- such as when a GitLab link is shared in a chat room"), > without creating new restrictions on the input for the search query string, > for example, not to mention having to do manual purges of the cached data > and missing all the other nice features of the standard proxy_cache. Using proxy_store with non-trivial URIs might be problematic, as well as using it for content which might change. Basically, it is a mechanism to mirror static files which never change. While using it as a cache is certainly possible, it is going to be non-trivial and error-prone solution. Additional proxy layer is probably much easier. Also, not directly related to the question, but rather about AI-scrapers in general: - For Mercurial repositories on freenginx.org, which effectively provide infinite number of distinct resources, I observe that AI-scrapers started to use large botnets with multiple IP addresses from different netblocks (millions of unique IP addresses identified as abusive AI-scrapers in just a couple of days). Limiting them with limit_req / limit_conn with traditional IP-based or netblock-based limits become ineffective. - Using userid session cookies (http:/freenginx.org/r/userid) and limiting users without $uid_got seems to be effective last-resort measure: abusive bots don't seem to try to use cookies at all. It can block legitimate users (if all the limits are already consumed by bots), but for legitimate users with real browsers it's just a matter of refreshing the page. I initially though I would have to implement some proof-of-work mechanism to stop them, similarly to what Anubis does, but trivial cookies seem to be quote effective as well. - Some AI-labyrinth solutions might be also effective here. Since AI-scrapers ignore "nofollow" (and that's why they try to scrape Mercurial repositories on freenginx.org in the first place), they basically can index any infinite resources. Which gives an opportunity to keep them indexing something really cheap to generate rather than real resources, without any negative effects on legitimate users or robots. Hope this helps. -- Maxim Dounin http://mdounin.ru/
