On 2025-04-22 14:06, Jonathan Wakely wrote: > On Tue, 22 Apr 2025 at 13:36, Guinevere Larsen via Gcc <g...@gcc.gnu.org> > wrote: > > > > On 4/21/25 12:59 PM, Mark Wielaard wrote: > > > Hi hackers, > > > > > > TLDR; When using https://patchwork.sourceware.org or Bunsen > > > https://builder.sourceware.org/testruns/ you might now have to enable > > > javascript. This should not impact any scripts, just browsers (or bots > > > pretending to be browsers). If it does cause trouble, please let us > > > know. If this works out we might also "protect" bugzilla, gitweb, > > > cgit, and the wikis this way. > > > > > > We don't like to hav to do this, but as some of you might have noticed > > > Sourceware has been fighting the new AI scraperbots since start of the > > > year. We are not alone in this. > > > > > > https://lwn.net/Articles/1008897/ > > > https://arstechnica.com/ai/2025/03/devs-say-ai-crawlers-dominate-traffic-forcing-blocks-on-entire-countries/ > > > > > > We have tried to isolate services more and block various ip-blocks > > > that were abusing the servers. But that has helped only so much. > > > Unfortunately the scraper bots are using lots of ip addresses > > > (probably by installing "free" VPN services that use normal user > > > connections as exit point) and pretending to be common > > > browsers/agents. We seem to have to make access to some services > > > depend on solving a javascript challenge. > > > > Jan Wildeboer, on the fediverse, has a pretty interesting lead on how AI > > scrapers might be doing this: > > https://social.wildeboer.net/@jwildeboer/114360486804175788 (this is the > > last post in the thread because it was hard to actually follow the > > thread given the number of replies, please go all the way up and read > > all 8 posts). > > > > Essentially, there's a library developer that pays developers to just > > "include this library and a few more lines in your TOS". This library > > then allows the app to sell the end-user's bandwidth to clients of the > > library developer, allowing them to make requests. This is how big > > companies are managing to have so many IP addresses, so many of those > > being residential IP addresses, and it also means that by blocking those > > IP addresses we will be - necessarily - blocking real user traffic to > > our platforms. > > It seems to me that blocking real users *who are running these shady > apps* is perfectly reasonable.
How do you detect them? From my experience at other hosting places, those IPs, just make a few request per hours or per day, with a standard User Agent. As such it's difficult to differentiate them from normal users. The problem is that you suddenly have hundreds of thousands of requests per hours from just a slightly lower number of IPs. And in the middle you also have legit users using IPs from the same net block. -- Aurelien Jarno GPG: 4096R/1DDD8C9B aurel...@aurel32.net http://aurel32.net