Re: Using 444

Brett Cooper Sat, 27 Sep 2025 19:52:11 -0700

Honestly, I wouldn't consider the 'AI vs resources' issue off-topic...granted I have a modest wall of text incoming. :-)

While I have not used the `go-away` package fromhttps://git.gammaspectra.live/git/go-away, I've also seen the effects ofAI-harvesters on web servers. The resource consumption can sometimes beabsolutely immense, be it software or physical capacity ceilings beingmet. That of which being due to AI-bots completely ignoring or evadingrate limiting due to the usage of massive ranges of CIDRs available.

With that said, the majority of "good" AI-harvesters/agents utilize auser agent, which makes blocking or rate limiting them at thenginx-level fairly straight forward. Otherwise, the more 'sneaky' AIharvesters, will generally be mimicking real or near-real-looking useragents. At that point, those AI-harvesters/agents are generally andpredominately based on 'cloud' CIDRs, making it fairly easy toblock/filter.

Which in turn, gives us some options to use against AIagents/harvesters.

1) Using something like the 'nginx ultimate bad bot blocker' projectlocated at:https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker andconfiguring it to be exceptionally strict (return 444 or rate limiting)against user agents deemed unwanted.2) Using iptables/nftables on Linux or an appliance in front of thenginx server to block/drop swaths of CIDRs relating to problematic/toxiccloud networks/data centers.

3) Stout rate limiting via making use of the ngx_http_limit_req module.

The best use-case is utilizing all three options.

In the case of using that ultimate bad bot blocker, I've added thefollowing to my blacklist-user-agents.conf file (this includes probingand AI clients):


"~*(?:\b)libwww-perl(?:\b)" 3;
"~*(?:\b)wget(?:\b)" 3;
"~*(?:\b)Go\-http\-client(?:\b)" 3;
"~*(?:\b)LieBaoFast(?:\b)" 3;
"~*(?:\b)Mb2345Browser(?:\b)" 3;
"~*(?:\b)MicroMessenger(?:\b)" 3;
"~*(?:\b)zh_CN(?:\b)" 3;
"~*(?:\b)Kinza(?:\b)" 3;
"~*(?:\b)Bytespider(?:\b)" 3; #TikTok Scraper
"~*(?:\b)Baiduspider(?:\b)" 3;
"~*(?:\b)Sogou(?:\b)" 3;
"~*(?:\b)Datanyze(?:\b)" 3;
"~*(?:\b)AspiegelBot(?:\b)" 3;
"~*(?:\b)adscanner(?:\b)" 3;
"~*(?:\b)serpstatbot(?:\b)" 3;
"~*(?:\b)spaziodat(?:\b)" 3;
"~*(?:\b)undefined(?:\b)" 3;
"~*(?:\b)claudebot(?:\b)" 3;
"~*(?:\b)anthropic\-ai(?:\b)" 3;
"~*(?:\b)ccbot(?:\b)" 3;
"~*(?:\b)FacebookBot(?:\b)" 3;
"~*(?:\b)OmigiliBot(?:\b)" 3;
"~*(?:\b)cohere\-ai(?:\b)" 3;
"~*(?:\b)Diffbot(?:\b)" 3;
"~*(?:\b)omgili(?:\b)" 3;
"~*(?:\b)GoogleOther(?:\b)" 3;
"~*(?:\b)Google\-Extended(?:\b)" 3;
"~*(?:\b)ChatGPT-User(?:\b)" 3;
"~*(?:\b)GPTBot(?:\b)" 3;
"~*(?:\b)Amazonbot(?:\b)" 3;
"~*(?:\b)Applebot(?:\b)" 3;
"~*(?:\b)PerplexityBot(?:\b)" 3;
"~*(?:\b)YouBot(?:\b)" 3;

I've probably left off a few on this list, but eh... This seems to havestopped the majority of AI scrapers/harvesters (and probers/exploiters)using such user agents. Thus, leaving out some misnomers to be blockedat the firewall level.

As for me, the ones being blocked at the firewall level are primarilyChinese based cloud providers. The worst one that I've seen to dateoccurred earlier this year where four different cloud data centers werebeing used (abused?). Ultimately, someone or some company used thesecloud provider services for mass scraping into AI harvesting/training.At one point, one of my servers was pushing thousands of requests asecond from hundreds of different IP addresses, none of the IP addressesbeing unique and all using varying user agents and random page accessesthat were previously scraped (A URL-list seems to have been collectedinitially). It wasn't until I blocked the majority of Alibaba Cloud(AS45102), Huawei Cloud (AS136907), TenCent Cloud (AS132203), and asmall amount from OVH (AS16276) did things /mostly/ return to normalcy.

I've never been a fan of the scorched-earth approach with blanketbanning/dropping providers in swaths such as this. It is legitimatelyabsurd that I've had to resort to doing such just to get some sanityback and resource usage under control.


--Brett


------ Original Message ------

From "Jeffrey Walton" <[email protected]>

To [email protected]
Date 09/27/2025 04:45:25 P
Subject Re: Using 444

On Sat, Sep 27, 2025 at 2:28 PM Paul <[email protected]> wrote:


 [...]
 Maxim, many thanks.  Currently battling a DDoS including out of control
 "AI". Front end nginx/1.18.0 (Ubuntu) easily handles volume (CPU usage
 rarely above 1%) but proxied apache2 often runs up to 98% across 12
 cores (complex cgi needs 20-40 ms per response.)

 I'm attempting to mitigate.  Your advice appreciated. I've "snipped"
 below for readability:


My apologies if this wanders too off-topic.

A lot of folks are having trouble due to AI Agents scraping their
sites for training data.  It hit the folks at GNU particularly hard.
If AI is so smart, then why does it not clone a project instead of
scraping source code presented as web pages???

You might consider putting a box on the front-end to handle the abuse
from AI agents.  Anibus, go-away and several others are popular.
go-away provides a list of similar projects at
<https://git.gammaspectra.live/git/go-away#other-similar-projects>.
In fact, go-away names Nginx's ngx_http_js_challenge_module as a
mitigation for the problem.

Jeff

Re: Using 444

Reply via email to