Honestly, I wouldn't consider the 'AI vs resources' issue off-topic... granted I have a modest wall of text incoming. :-)

While I have not used the `go-away` package from https://git.gammaspectra.live/git/go-away, I've also seen the effects of AI-harvesters on web servers. The resource consumption can sometimes be absolutely immense, be it software or physical capacity ceilings being met. That of which being due to AI-bots completely ignoring or evading rate limiting due to the usage of massive ranges of CIDRs available.

With that said, the majority of "good" AI-harvesters/agents utilize a user agent, which makes blocking or rate limiting them at the nginx-level fairly straight forward. Otherwise, the more 'sneaky' AI harvesters, will generally be mimicking real or near-real-looking user agents. At that point, those AI-harvesters/agents are generally and predominately based on 'cloud' CIDRs, making it fairly easy to block/filter.

Which in turn, gives us some options to use against AI agents/harvesters.

1) Using something like the 'nginx ultimate bad bot blocker' project located at: https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker and configuring it to be exceptionally strict (return 444 or rate limiting) against user agents deemed unwanted. 2) Using iptables/nftables on Linux or an appliance in front of the nginx server to block/drop swaths of CIDRs relating to problematic/toxic cloud networks/data centers.
3) Stout rate limiting via making use of the ngx_http_limit_req module.

The best use-case is utilizing all three options.

In the case of using that ultimate bad bot blocker, I've added the following to my blacklist-user-agents.conf file (this includes probing and AI clients):

"~*(?:\b)libwww-perl(?:\b)" 3;
"~*(?:\b)wget(?:\b)" 3;
"~*(?:\b)Go\-http\-client(?:\b)" 3;
"~*(?:\b)LieBaoFast(?:\b)" 3;
"~*(?:\b)Mb2345Browser(?:\b)" 3;
"~*(?:\b)MicroMessenger(?:\b)" 3;
"~*(?:\b)zh_CN(?:\b)" 3;
"~*(?:\b)Kinza(?:\b)" 3;
"~*(?:\b)Bytespider(?:\b)" 3; #TikTok Scraper
"~*(?:\b)Baiduspider(?:\b)" 3;
"~*(?:\b)Sogou(?:\b)" 3;
"~*(?:\b)Datanyze(?:\b)" 3;
"~*(?:\b)AspiegelBot(?:\b)" 3;
"~*(?:\b)adscanner(?:\b)" 3;
"~*(?:\b)serpstatbot(?:\b)" 3;
"~*(?:\b)spaziodat(?:\b)" 3;
"~*(?:\b)undefined(?:\b)" 3;
"~*(?:\b)claudebot(?:\b)" 3;
"~*(?:\b)anthropic\-ai(?:\b)" 3;
"~*(?:\b)ccbot(?:\b)" 3;
"~*(?:\b)FacebookBot(?:\b)" 3;
"~*(?:\b)OmigiliBot(?:\b)" 3;
"~*(?:\b)cohere\-ai(?:\b)" 3;
"~*(?:\b)Diffbot(?:\b)" 3;
"~*(?:\b)omgili(?:\b)" 3;
"~*(?:\b)GoogleOther(?:\b)" 3;
"~*(?:\b)Google\-Extended(?:\b)" 3;
"~*(?:\b)ChatGPT-User(?:\b)" 3;
"~*(?:\b)GPTBot(?:\b)" 3;
"~*(?:\b)Amazonbot(?:\b)" 3;
"~*(?:\b)Applebot(?:\b)" 3;
"~*(?:\b)PerplexityBot(?:\b)" 3;
"~*(?:\b)YouBot(?:\b)" 3;

I've probably left off a few on this list, but eh... This seems to have stopped the majority of AI scrapers/harvesters (and probers/exploiters) using such user agents. Thus, leaving out some misnomers to be blocked at the firewall level.

As for me, the ones being blocked at the firewall level are primarily Chinese based cloud providers. The worst one that I've seen to date occurred earlier this year where four different cloud data centers were being used (abused?). Ultimately, someone or some company used these cloud provider services for mass scraping into AI harvesting/training. At one point, one of my servers was pushing thousands of requests a second from hundreds of different IP addresses, none of the IP addresses being unique and all using varying user agents and random page accesses that were previously scraped (A URL-list seems to have been collected initially). It wasn't until I blocked the majority of Alibaba Cloud (AS45102), Huawei Cloud (AS136907), TenCent Cloud (AS132203), and a small amount from OVH (AS16276) did things /mostly/ return to normalcy.

I've never been a fan of the scorched-earth approach with blanket banning/dropping providers in swaths such as this. It is legitimately absurd that I've had to resort to doing such just to get some sanity back and resource usage under control.

--Brett


------ Original Message ------
From "Jeffrey Walton" <[email protected]>
To [email protected]
Date 09/27/2025 04:45:25 P
Subject Re: Using 444

On Sat, Sep 27, 2025 at 2:28 PM Paul <[email protected]> wrote:

 [...]
 Maxim, many thanks.  Currently battling a DDoS including out of control
 "AI". Front end nginx/1.18.0 (Ubuntu) easily handles volume (CPU usage
 rarely above 1%) but proxied apache2 often runs up to 98% across 12
 cores (complex cgi needs 20-40 ms per response.)

 I'm attempting to mitigate.  Your advice appreciated. I've "snipped"
 below for readability:

My apologies if this wanders too off-topic.

A lot of folks are having trouble due to AI Agents scraping their
sites for training data.  It hit the folks at GNU particularly hard.
If AI is so smart, then why does it not clone a project instead of
scraping source code presented as web pages???

You might consider putting a box on the front-end to handle the abuse
from AI agents.  Anibus, go-away and several others are popular.
go-away provides a list of similar projects at
<https://git.gammaspectra.live/git/go-away#other-similar-projects>.
In fact, go-away names Nginx's ngx_http_js_challenge_module as a
mitigation for the problem.

Jeff

Reply via email to