[Evergreen-dev] Re: Problematic bot traffic

John Merriam via Evergreen-dev Thu, 10 Jul 2025 09:28:12 -0700

Hi Mike.

I hear you on the old Chrome user agents. For now that is what we'regoing with since it works. So far no complaints.


Thank you very much for the pointer to LP2113979  We will look into that.

Regarding Anubis, I'm not a fan at the moment unless there is no otherway. I could be wrong but I think it will break accessibility (would itwork with software blind patrons use)? I also have a problem withwasting all that energy. It's bad enough that AI data centers are ontrack to destroy the planet. I don't want to have to waste tons ofenergy fighting them.



On 7/10/2025 12:07 PM, Mike Rylander wrote:

Some things to consider, inline below...

On Thu, Jul 10, 2025 at 11:25 AM John Merriam via Evergreen-dev
<evergreen-dev@list.evergreen-ils.org> wrote:

Hello.

This will block Chrome older than 110 (over 2 years old) in Nginx:

if ($http_user_agent ~* 
"(Chrome/10[0-9]\.|Chrome/[0-9][0-9]\.|Chrome/[0-9]\.)") {
     return 403;
}

which put a stop to it for now for us.

Please be careful.  In addition to patrons with old browsers (there
are plenty out there, unfortunately) there are some black-box kiosks
out in the wild that are used for selfcheck and in-building OPAC
machines which use an older Chrome (and are not free to upgrade).

Changing user agents is trivial though so finding other blockable patterns such 
as in URLs would be good.  I didn't find a good pattern to the URLs yet but I 
was only able to look at that quickly.  I plan on circling back around to that 
at some point.

I don't think blocking by IP will work against what seems to be a distributed 
AI botnet.  A few months ago we had our data center partners block all non-US 
IPs.  That worked for a few months but even that doesn't work anymore.  We see 
AI bot traffic coming from US residential IP ranges.  A gigantic question I 
have is how are they appearing to come from residential IPs and how could that 
be stopped?

We plan to profile Evergreen looking for slow code that could maybe be improved 
but that will be a big project.

I invite more eyes, of course, but "big project" is a bit of an
understatement. ;)

Please be careful when testing something that seems "slow" in
isolation -- making code X 10% faster will often make
seemingly-unrelated code Y 90% slower.

We also plan to hook a WAF with machine learning into Nginx and see what that 
can do.  Another big project.

We may also put captcha on more parts of the OPAC.  We have someone working on 
that.

Have you looked at https://bugs.launchpad.net/evergreen/+bug/2113979?
With some refinement of the URL space where the not-a-bot cookie is
required, this is shaping up to be a good first-order bot killer.

I can allocate more resources to the OPAC but that seems like letting them win 
and they will probably eventually exhaust that as well.

Anubis is a nuclear option I would like to avoid.

I'm curious why you see this as a nuclear option.  Granted, most AI
scrapers right now (at least, AFAICT) seem to be essentially
stateless, so it may be overkill compared to the LP bug linked above,
but it's fairly straight-forward to set up and maintain.  The only
drawback right now is that you have to use just one instance, which
could become a bottleneck in a very "wide" EG setup.

Also don't want to turn to something like Cloudflare.

It's certainly not cost effective for the Library space...

Please do share any findings and I will as well.

Thanks


On 7/10/2025 10:53 AM, Josh Stompro via Evergreen-dev wrote:

One piece of this puzzle that I would like to understand better is how the bad 
actors are targeting our sites with thousands to hundreds of thousands of 
unique IP endpoints each day.  And I just saw this article come out about how 1 
million browsers have installed extensions that turn the users browser into 
scrapers.

https://arstechnica.com/security/2025/07/browser-extensions-turn-nearly-1-million-browsers-into-website-scraping-bots/

Josh


On Thu, Feb 13, 2025 at 3:49 PM Shula Link via Evergreen-dev 
<evergreen-dev@list.evergreen-ils.org> wrote:

It's not just Evergreen sites. I had to block all traffic from Hong Kong to our 
system website after we had a greater than 10x increase in visitors overnight. 
I tried doing it by IP, but they just changed, so it ended up just being easier 
to just block everything.

Shula Link (she/her)
Systems Services Librarian
Greater Clarks Hill Regional Library
sl...@columbiacountyga.gov | sl...@gchrl.org
706-447-6702


On Thu, Feb 13, 2025 at 4:46 PM Blake Graham-Henderson via Evergreen-dev 
<evergreen-dev@list.evergreen-ils.org> wrote:

All,

I almost replied with the arstechnica article that Josh linked when the thread
was started. But I decided not to put it out there until I had setup a test
system to see if I could get that code working. A tarpit, I think, serves them
right. And, of course, the whole issue is destined to receive the fate of spam
and spam filters forever and ever.

It was a serendipitous timed article. It's existence at this moment in time signals to me
that this isn't a "just us" problem. It's the entire planet.

-Blake-
Conducting Magic
Will consume any data format
MOBIUS

On 2/13/2025 3:10 PM, Josh Stompro via Evergreen-dev wrote:

Jeff, thanks for bringing this up on the list.

We are seeing a lot of requests like
"GET /eg/opac/mylist/delete?anchor=record_184821&record=184821" from never
seen before IPs, and they make 1-12 requests and then stop.

And they seem like they usually have a random out of date chrome version in the
user agent string.
Chrome/88.0.4324.192
Chrome/86.0.4240.75

I've been trying to slow down the bots by collecting logs and grabbing all the
obvious patterns and blocking netblocks for non US ranges. ipinfo.io offers a free
country & ASN database download that I've been using to look up the ranges and
countries. (https://ipinfo.io/products/free-ip-database) I would be happy to share
a link to our current blocklist that has 10K non US ranges.

I've also been reporting the non US bot activity to https://www.abuseipdb.com/
just to bring some visibility to these bad bots. I noticed initially that many
of the IPs that we were getting hit from didn't seem to be listed on any
blocklists already, so I figured some reporting might help. I'm kind of
curious if Evergreen sites are getting hit from the same IPs, so an evergreen
specific blocklist would be useful. If you look up your bot IPs on
abuseipdb.com you can see if I've already reported any of them.

I've also been making use of block lists from https://iplists.firehol.org/
Such as
https://iplists.firehol.org/files/cleantalk_30d.ipset
https://iplists.firehol.org/files/botscout_7d.ipset
https://iplists.firehol.org/files/firehol_abusers_1d.netset

We are using HAProxy so I did some looking into the CrowdSec HAProxy Bouncer
(https://docs.crowdsec.net/u/bouncers/haproxy/) but I'm not sure that would
help since these IPs don't seem to be on blocklists. But I may just not quite
understand how CrowdSec is supposed to work.

HAProxy Enterprise has a ReCaptcha module that I think would allow us to feed
any non-us connections that haven't connected before through a recaptcha, but
the price for HAProxy Enterprise is out of our budget.
https://www.haproxy.com/blog/announcing-haproxy-enterprise-3-0#new-captcha-and-saml-modules

There is also a fairly up to date project for adding Captchas through haproxy at
https://github.com/ndbiaw/haproxy-protection, This looks promising as a
transparent method, requires new connections to perform a javascript proof of
work calculation before allowing access. Could be a good transparent way of
handling it.

We were taken out by ChatGTP bots back in December, which were a bit easier to
block the netblocks since they were not as spread out. I recently saw this
article about how some people are fighting back against bots that ignore
robots.txt,
https://arstechnica.com/tech-policy/2025/01/ai-haters-build-tarpits-to-trap-and-trick-ai-scrapers-that-ignore-robots-txt/

Josh

On Mon, Jan 27, 2025 at 6:33 PM Jeff Davis via Evergreen-dev
<evergreen-dev@list.evergreen-ils.org> wrote:

Hi folks,

Our Evergreen environment has been experiencing a higher-than-usual volume of 
unwanted bot traffic in recent months. Much of this traffic looks like 
webcrawlers hitting Evergreen-specific URLs from an enormous number of 
different IP addresses. Judging from discussion in IRC last week, it sounds 
like other EG admins have been seeing the same thing. Does anyone have any 
recommendations for managing this traffic and mitigating its impact?

Some solutions that have been suggested/implemented so far:
- Geoblocking entire countries.
- Using Cloudflare's proxy service. There's some trickiness in getting this to 
work with Evergreen.
- Putting certain OPAC pages behind a captcha.
- Deploying publicly-available blocklists of "bad bot" IPs/useragents/etc. 
(good but limited, and not EG-specific).
- Teaching EG to identify and deal with bot traffic itself (but arguably this 
should happen before the traffic hits Evergreen).

My organization is currently evaluating CrowdSec as another possible solution. 
Any opinions on any of these approaches?
--
Jeff Davis
BC Libraries Cooperative
_______________________________________________
Evergreen-dev mailing list
Evergreen-dev@list.evergreen-ils.org
http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-dev


_______________________________________________
Evergreen-dev mailing list
Evergreen-dev@list.evergreen-ils.org
http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-dev


_______________________________________________
Evergreen-dev mailing list
Evergreen-dev@list.evergreen-ils.org
http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-dev

_______________________________________________
Evergreen-dev mailing list
Evergreen-dev@list.evergreen-ils.org
http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-dev


_______________________________________________
Evergreen-dev mailing list -- evergreen-dev@list.evergreen-ils.org
To unsubscribe send an email to evergreen-dev-le...@list.evergreen-ils.org

--
John Merriam
Director of Information Technology
Bibliomation, Inc.
24 Wooster Ave.
Waterbury, CT 06708
203-577-4070

_______________________________________________
Evergreen-dev mailing list -- evergreen-dev@list.evergreen-ils.org
To unsubscribe send an email to evergreen-dev-le...@list.evergreen-ils.org


--
John Merriam
Director of Information Technology
Bibliomation, Inc.
24 Wooster Ave.
Waterbury, CT 06708
203-577-4070

_______________________________________________
Evergreen-dev mailing list -- evergreen-dev@list.evergreen-ils.org
To unsubscribe send an email to evergreen-dev-le...@list.evergreen-ils.org

[Evergreen-dev] Re: Problematic bot traffic

Reply via email to