[Evergreen-dev] Re: Problematic bot traffic

Jason Boyer via Evergreen-dev Thu, 10 Jul 2025 09:59:07 -0700

You can probably give up trying to look for IPs that send large numbers of
requests, what I'm seeing more and more are requests from these jerks or
their peers: https://bright data.com/ai/agent-browser who have "residential
proxies," i.e. the browser extensions mentioned in the story Josh posted.
They send literally a single http request from an IP (usually on a US
telecom provider's network so you can't reasonably block it) and then the
next request comes in from a different IP.


The patch in the bug Mike posted helps significantly and unless users trade
a lot of direct links to search results they shouldn't be able to even
detect it.

I'm looking into Anubis because we can put it in front of things more
easily than baking countermeasures into everything we host. Being
completely self-contained (i.e. it doesn't contact a remote server unless
you want to use a geo ip / AS number blocking service) I prefer it to
cloudflare, especially since their "good" bot blocking isn't affordable for
libraries. (I think the free level basically just doesn't allow things that
use a "real" bot UA to connect to your system, if you want to block
anything like a residential proxy you have to pay)

Some thoughts on UA blocking since it's come up a little: don't forget you
can do things like block things claiming to be Chrome < 100 on Windows or
macOS, and limit Linux versions to a different limit. Chrome will go so far
on Windows and Macs as to tell you "ok look, it's been too long, I'm
restarting and then we'll go to whatever page," so very old versions on
those OSes are extremely unlikely. Linux can be a concern though, in case
you have libraries that have very old OPACs or similar. Also be sure to
block things like Windows 95 / 98 (but again, maybe some libraries have win
7 opacs. :( ), old versions of Firefox, and anything claiming to be IE.
Things actually that old likely can't even complete an SSL handshake
anymore after some of the root certs have been rotated. A lot of proxies
are using randomly-constructed UAs to make it harder to bulk-block them.

Jason

-- 
Jason Boyer
Senior System Administrator
Equinox Open Library Initiative
jbo...@equinoxoli.org
+1 (877) Open-ILS (673-6457)
https://equinoxOLI.org/


On Thu, Jul 10, 2025 at 12:08 PM Mike Rylander via Evergreen-dev <
evergreen-dev@list.evergreen-ils.org> wrote:

> Some things to consider, inline below...
>
> On Thu, Jul 10, 2025 at 11:25 AM John Merriam via Evergreen-dev
> <evergreen-dev@list.evergreen-ils.org> wrote:
> >
> > Hello.
> >
> > This will block Chrome older than 110 (over 2 years old) in Nginx:
> >
> > if ($http_user_agent ~*
> "(Chrome/10[0-9]\.|Chrome/[0-9][0-9]\.|Chrome/[0-9]\.)") {
> >     return 403;
> > }
> >
> > which put a stop to it for now for us.
> >
>
> Please be careful.  In addition to patrons with old browsers (there
> are plenty out there, unfortunately) there are some black-box kiosks
> out in the wild that are used for selfcheck and in-building OPAC
> machines which use an older Chrome (and are not free to upgrade).
>
> > Changing user agents is trivial though so finding other blockable
> patterns such as in URLs would be good.  I didn't find a good pattern to
> the URLs yet but I was only able to look at that quickly.  I plan on
> circling back around to that at some point.
> >
> > I don't think blocking by IP will work against what seems to be a
> distributed AI botnet.  A few months ago we had our data center partners
> block all non-US IPs.  That worked for a few months but even that doesn't
> work anymore.  We see AI bot traffic coming from US residential IP ranges.
> A gigantic question I have is how are they appearing to come from
> residential IPs and how could that be stopped?
> >
> > We plan to profile Evergreen looking for slow code that could maybe be
> improved but that will be a big project.
> >
>
> I invite more eyes, of course, but "big project" is a bit of an
> understatement. ;)
>
> Please be careful when testing something that seems "slow" in
> isolation -- making code X 10% faster will often make
> seemingly-unrelated code Y 90% slower.
>
> > We also plan to hook a WAF with machine learning into Nginx and see what
> that can do.  Another big project.
> >
> > We may also put captcha on more parts of the OPAC.  We have someone
> working on that.
> >
>
> Have you looked at https://bugs.launchpad.net/evergreen/+bug/2113979?
> With some refinement of the URL space where the not-a-bot cookie is
> required, this is shaping up to be a good first-order bot killer.
>
> > I can allocate more resources to the OPAC but that seems like letting
> them win and they will probably eventually exhaust that as well.
> >
> > Anubis is a nuclear option I would like to avoid.
> >
>
> I'm curious why you see this as a nuclear option.  Granted, most AI
> scrapers right now (at least, AFAICT) seem to be essentially
> stateless, so it may be overkill compared to the LP bug linked above,
> but it's fairly straight-forward to set up and maintain.  The only
> drawback right now is that you have to use just one instance, which
> could become a bottleneck in a very "wide" EG setup.
>
> > Also don't want to turn to something like Cloudflare.
> >
>
> It's certainly not cost effective for the Library space...
>
> > Please do share any findings and I will as well.
> >
> > Thanks
> >
> >
> > On 7/10/2025 10:53 AM, Josh Stompro via Evergreen-dev wrote:
> >
> > One piece of this puzzle that I would like to understand better is how
> the bad actors are targeting our sites with thousands to hundreds of
> thousands of unique IP endpoints each day.  And I just saw this article
> come out about how 1 million browsers have installed extensions that turn
> the users browser into scrapers.
> >
> >
> https://arstechnica.com/security/2025/07/browser-extensions-turn-nearly-1-million-browsers-into-website-scraping-bots/
> >
> > Josh
> >
> >
> > On Thu, Feb 13, 2025 at 3:49 PM Shula Link via Evergreen-dev <
> evergreen-dev@list.evergreen-ils.org> wrote:
> >>
> >> It's not just Evergreen sites. I had to block all traffic from Hong
> Kong to our system website after we had a greater than 10x increase in
> visitors overnight. I tried doing it by IP, but they just changed, so it
> ended up just being easier to just block everything.
> >>
> >> Shula Link (she/her)
> >> Systems Services Librarian
> >> Greater Clarks Hill Regional Library
> >> sl...@columbiacountyga.gov | sl...@gchrl.org
> >> 706-447-6702
> >>
> >>
> >> On Thu, Feb 13, 2025 at 4:46 PM Blake Graham-Henderson via
> Evergreen-dev <evergreen-dev@list.evergreen-ils.org> wrote:
> >>>
> >>> All,
> >>>
> >>> I almost replied with the arstechnica article that Josh linked when
> the thread was started. But I decided not to put it out there until I had
> setup a test system to see if I could get that code working. A tarpit, I
> think, serves them right. And, of course, the whole issue is destined to
> receive the fate of spam and spam filters forever and ever.
> >>>
> >>> It was a serendipitous timed article. It's existence at this moment in
> time signals to me that this isn't a "just us" problem. It's the entire
> planet.
> >>>
> >>> -Blake-
> >>> Conducting Magic
> >>> Will consume any data format
> >>> MOBIUS
> >>>
> >>> On 2/13/2025 3:10 PM, Josh Stompro via Evergreen-dev wrote:
> >>>
> >>> Jeff, thanks for bringing this up on the list.
> >>>
> >>> We are seeing a lot of requests like
> >>>  "GET /eg/opac/mylist/delete?anchor=record_184821&record=184821" from
> never seen before IPs, and they make 1-12 requests and then stop.
> >>>
> >>> And they seem like they usually have a random out of date chrome
> version in the user agent string.
> >>> Chrome/88.0.4324.192
> >>> Chrome/86.0.4240.75
> >>>
> >>> I've been trying to slow down the bots by collecting logs and grabbing
> all the obvious patterns and blocking netblocks for non US ranges.
> ipinfo.io offers a free country & ASN database download that I've been
> using to look up the ranges and countries. (
> https://ipinfo.io/products/free-ip-database)  I would be happy to share a
> link to our current blocklist that has 10K non US ranges.
> >>>
> >>> I've also been reporting the non US bot activity to
> https://www.abuseipdb.com/ just to bring some visibility to these bad
> bots.  I noticed initially that many of the IPs that we were getting hit
> from didn't seem to be listed on any blocklists already, so I figured some
> reporting might help.  I'm kind of curious if Evergreen sites are getting
> hit from the same IPs, so an evergreen specific blocklist would be useful.
> If you look up your bot IPs on abuseipdb.com you can see if I've already
> reported any of them.
> >>>
> >>> I've also been making use of block lists from
> https://iplists.firehol.org/
> >>> Such as
> >>> https://iplists.firehol.org/files/cleantalk_30d.ipset
> >>> https://iplists.firehol.org/files/botscout_7d.ipset
> >>> https://iplists.firehol.org/files/firehol_abusers_1d.netset
> >>>
> >>> We are using HAProxy so I did some looking into the CrowdSec HAProxy
> Bouncer (https://docs.crowdsec.net/u/bouncers/haproxy/) but I'm not sure
> that would help since these IPs don't seem to be on blocklists.  But I may
> just not quite understand how CrowdSec is supposed to work.
> >>>
> >>> HAProxy Enterprise has a ReCaptcha module that I think would allow us
> to feed any non-us connections that haven't connected before through a
> recaptcha, but the price for HAProxy Enterprise is out of our budget.
> https://www.haproxy.com/blog/announcing-haproxy-enterprise-3-0#new-captcha-and-saml-modules
> >>>
> >>> There is also a fairly up to date project for adding Captchas through
> haproxy at
> >>> https://github.com/ndbiaw/haproxy-protection, This looks promising as
> a transparent method, requires new connections to perform a javascript
> proof of work calculation before allowing access.  Could be a good
> transparent way of handling it.
> >>>
> >>> We were taken out by ChatGTP bots back in December, which were a bit
> easier to block the netblocks since they were not as spread out.  I
> recently saw this article about how some people are fighting back against
> bots that ignore robots.txt,
> https://arstechnica.com/tech-policy/2025/01/ai-haters-build-tarpits-to-trap-and-trick-ai-scrapers-that-ignore-robots-txt/
> >>>
> >>> Josh
> >>>
> >>> On Mon, Jan 27, 2025 at 6:33 PM Jeff Davis via Evergreen-dev <
> evergreen-dev@list.evergreen-ils.org> wrote:
> >>>>
> >>>> Hi folks,
> >>>>
> >>>> Our Evergreen environment has been experiencing a higher-than-usual
> volume of unwanted bot traffic in recent months. Much of this traffic looks
> like webcrawlers hitting Evergreen-specific URLs from an enormous number of
> different IP addresses. Judging from discussion in IRC last week, it sounds
> like other EG admins have been seeing the same thing. Does anyone have any
> recommendations for managing this traffic and mitigating its impact?
> >>>>
> >>>> Some solutions that have been suggested/implemented so far:
> >>>> - Geoblocking entire countries.
> >>>> - Using Cloudflare's proxy service. There's some trickiness in
> getting this to work with Evergreen.
> >>>> - Putting certain OPAC pages behind a captcha.
> >>>> - Deploying publicly-available blocklists of "bad bot"
> IPs/useragents/etc. (good but limited, and not EG-specific).
> >>>> - Teaching EG to identify and deal with bot traffic itself (but
> arguably this should happen before the traffic hits Evergreen).
> >>>>
> >>>> My organization is currently evaluating CrowdSec as another possible
> solution. Any opinions on any of these approaches?
> >>>> --
> >>>> Jeff Davis
> >>>> BC Libraries Cooperative
> >>>> _______________________________________________
> >>>> Evergreen-dev mailing list
> >>>> Evergreen-dev@list.evergreen-ils.org
> >>>> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-dev
> >>>
> >>>
> >>> _______________________________________________
> >>> Evergreen-dev mailing list
> >>> Evergreen-dev@list.evergreen-ils.org
> >>> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-dev
> >>>
> >>>
> >>> _______________________________________________
> >>> Evergreen-dev mailing list
> >>> Evergreen-dev@list.evergreen-ils.org
> >>> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-dev
> >>
> >> _______________________________________________
> >> Evergreen-dev mailing list
> >> Evergreen-dev@list.evergreen-ils.org
> >> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-dev
> >
> >
> > _______________________________________________
> > Evergreen-dev mailing list -- evergreen-dev@list.evergreen-ils.org
> > To unsubscribe send an email to
> evergreen-dev-le...@list.evergreen-ils.org
> >
> > --
> > John Merriam
> > Director of Information Technology
> > Bibliomation, Inc.
> > 24 Wooster Ave.
> > Waterbury, CT 06708
> > 203-577-4070
> >
> > _______________________________________________
> > Evergreen-dev mailing list -- evergreen-dev@list.evergreen-ils.org
> > To unsubscribe send an email to
> evergreen-dev-le...@list.evergreen-ils.org
> _______________________________________________
> Evergreen-dev mailing list -- evergreen-dev@list.evergreen-ils.org
> To unsubscribe send an email to evergreen-dev-le...@list.evergreen-ils.org
>

_______________________________________________
Evergreen-dev mailing list -- evergreen-dev@list.evergreen-ils.org
To unsubscribe send an email to evergreen-dev-le...@list.evergreen-ils.org

[Evergreen-dev] Re: Problematic bot traffic

Reply via email to