Re: scraperbot protection - Patchwork and Bunsen behind Anubis

Aurelien Jarno Tue, 22 Apr 2025 14:40:57 -0700

On 2025-04-22 14:06, Jonathan Wakely wrote:
> On Tue, 22 Apr 2025 at 13:36, Guinevere Larsen via Gcc <g...@gcc.gnu.org> 
> wrote:
> >
> > On 4/21/25 12:59 PM, Mark Wielaard wrote:
> > > Hi hackers,
> > >
> > > TLDR; When using https://patchwork.sourceware.org or Bunsen
> > > https://builder.sourceware.org/testruns/ you might now have to enable
> > > javascript. This should not impact any scripts, just browsers (or bots
> > > pretending to be browsers). If it does cause trouble, please let us
> > > know. If this works out we might also "protect" bugzilla, gitweb,
> > > cgit, and the wikis this way.
> > >
> > > We don't like to hav to do this, but as some of you might have noticed
> > > Sourceware has been fighting the new AI scraperbots since start of the
> > > year. We are not alone in this.
> > >
> > > https://lwn.net/Articles/1008897/
> > > https://arstechnica.com/ai/2025/03/devs-say-ai-crawlers-dominate-traffic-forcing-blocks-on-entire-countries/
> > >
> > > We have tried to isolate services more and block various ip-blocks
> > > that were abusing the servers. But that has helped only so much.
> > > Unfortunately the scraper bots are using lots of ip addresses
> > > (probably by installing "free" VPN services that use normal user
> > > connections as exit point) and pretending to be common
> > > browsers/agents.  We seem to have to make access to some services
> > > depend on solving a javascript challenge.
> >
> > Jan Wildeboer, on the fediverse, has a pretty interesting lead on how AI
> > scrapers might be doing this:
> > https://social.wildeboer.net/@jwildeboer/114360486804175788 (this is the
> > last post in the thread because it was hard to actually follow the
> > thread given the number of replies, please go all the way up and read
> > all 8 posts).
> >
> > Essentially, there's a library developer that pays developers to just
> > "include this library and a few more lines in your TOS". This library
> > then allows the app to sell the end-user's bandwidth to clients of the
> > library developer, allowing them to make requests. This is how big
> > companies are managing to have so many IP addresses, so many of those
> > being residential IP addresses, and it also means that by blocking those
> > IP addresses we will be - necessarily - blocking real user traffic to
> > our platforms.
> 
> It seems to me that blocking real users *who are running these shady
> apps* is perfectly reasonable.


How do you detect them? From my experience at other hosting places, 
those IPs, just make a few request per hours or per day, with a standard
User Agent. As such it's difficult to differentiate them from normal 
users.

The problem is that you suddenly have hundreds of thousands of requests 
per hours from just a slightly lower number of IPs. And in the middle 
you also have legit users using IPs from the same net block.

-- 
Aurelien Jarno                          GPG: 4096R/1DDD8C9B
aurel...@aurel32.net                     http://aurel32.net

Re: scraperbot protection - Patchwork and Bunsen behind Anubis

Reply via email to