[Evergreen-dev] Re: Problematic bot traffic

Josh Stompro via Evergreen-dev Fri, 11 Jul 2025 09:58:29 -0700

If you need an Aspen site to be a test site for Anubis you can put LARL
down on the list.  I believe you can whitelist IPs with Anubis so our
branches and catalog stations can skip the checks.
Josh


On Thu, Jul 10, 2025 at 11:58 AM Jason Boyer via Evergreen-dev <
evergreen-dev@list.evergreen-ils.org> wrote:

> You can probably give up trying to look for IPs that send large numbers of
> requests, what I'm seeing more and more are requests from these jerks or
> their peers: https://bright data.com/ai/agent-browser who have
> "residential proxies," i.e. the browser extensions mentioned in the story
> Josh posted. They send literally a single http request from an IP (usually
> on a US telecom provider's network so you can't reasonably block it) and
> then the next request comes in from a different IP.
>
> The patch in the bug Mike posted helps significantly and unless users
> trade a lot of direct links to search results they shouldn't be able to
> even detect it.
>
> I'm looking into Anubis because we can put it in front of things more
> easily than baking countermeasures into everything we host. Being
> completely self-contained (i.e. it doesn't contact a remote server unless
> you want to use a geo ip / AS number blocking service) I prefer it to
> cloudflare, especially since their "good" bot blocking isn't affordable for
> libraries. (I think the free level basically just doesn't allow things that
> use a "real" bot UA to connect to your system, if you want to block
> anything like a residential proxy you have to pay)
>
> Some thoughts on UA blocking since it's come up a little: don't forget you
> can do things like block things claiming to be Chrome < 100 on Windows or
> macOS, and limit Linux versions to a different limit. Chrome will go so far
> on Windows and Macs as to tell you "ok look, it's been too long, I'm
> restarting and then we'll go to whatever page," so very old versions on
> those OSes are extremely unlikely. Linux can be a concern though, in case
> you have libraries that have very old OPACs or similar. Also be sure to
> block things like Windows 95 / 98 (but again, maybe some libraries have win
> 7 opacs. :( ), old versions of Firefox, and anything claiming to be IE.
> Things actually that old likely can't even complete an SSL handshake
> anymore after some of the root certs have been rotated. A lot of proxies
> are using randomly-constructed UAs to make it harder to bulk-block them.
>
> Jason
>
> --
> Jason Boyer
> Senior System Administrator
> Equinox Open Library Initiative
> jbo...@equinoxoli.org
> +1 (877) Open-ILS (673-6457)
> https://equinoxOLI.org/
>
>
> On Thu, Jul 10, 2025 at 12:08 PM Mike Rylander via Evergreen-dev <
> evergreen-dev@list.evergreen-ils.org> wrote:
>
>> Some things to consider, inline below...
>>
>> On Thu, Jul 10, 2025 at 11:25 AM John Merriam via Evergreen-dev
>> <evergreen-dev@list.evergreen-ils.org> wrote:
>> >
>> > Hello.
>> >
>> > This will block Chrome older than 110 (over 2 years old) in Nginx:
>> >
>> > if ($http_user_agent ~*
>> "(Chrome/10[0-9]\.|Chrome/[0-9][0-9]\.|Chrome/[0-9]\.)") {
>> >     return 403;
>> > }
>> >
>> > which put a stop to it for now for us.
>> >
>>
>> Please be careful.  In addition to patrons with old browsers (there
>> are plenty out there, unfortunately) there are some black-box kiosks
>> out in the wild that are used for selfcheck and in-building OPAC
>> machines which use an older Chrome (and are not free to upgrade).
>>
>> > Changing user agents is trivial though so finding other blockable
>> patterns such as in URLs would be good.  I didn't find a good pattern to
>> the URLs yet but I was only able to look at that quickly.  I plan on
>> circling back around to that at some point.
>> >
>> > I don't think blocking by IP will work against what seems to be a
>> distributed AI botnet.  A few months ago we had our data center partners
>> block all non-US IPs.  That worked for a few months but even that doesn't
>> work anymore.  We see AI bot traffic coming from US residential IP ranges.
>> A gigantic question I have is how are they appearing to come from
>> residential IPs and how could that be stopped?
>> >
>> > We plan to profile Evergreen looking for slow code that could maybe be
>> improved but that will be a big project.
>> >
>>
>> I invite more eyes, of course, but "big project" is a bit of an
>> understatement. ;)
>>
>> Please be careful when testing something that seems "slow" in
>> isolation -- making code X 10% faster will often make
>> seemingly-unrelated code Y 90% slower.
>>
>> > We also plan to hook a WAF with machine learning into Nginx and see
>> what that can do.  Another big project.
>> >
>> > We may also put captcha on more parts of the OPAC.  We have someone
>> working on that.
>> >
>>
>> Have you looked at https://bugs.launchpad.net/evergreen/+bug/2113979?
>> With some refinement of the URL space where the not-a-bot cookie is
>> required, this is shaping up to be a good first-order bot killer.
>>
>> > I can allocate more resources to the OPAC but that seems like letting
>> them win and they will probably eventually exhaust that as well.
>> >
>> > Anubis is a nuclear option I would like to avoid.
>> >
>>
>> I'm curious why you see this as a nuclear option.  Granted, most AI
>> scrapers right now (at least, AFAICT) seem to be essentially
>> stateless, so it may be overkill compared to the LP bug linked above,
>> but it's fairly straight-forward to set up and maintain.  The only
>> drawback right now is that you have to use just one instance, which
>> could become a bottleneck in a very "wide" EG setup.
>>
>> > Also don't want to turn to something like Cloudflare.
>> >
>>
>> It's certainly not cost effective for the Library space...
>>
>> > Please do share any findings and I will as well.
>> >
>> > Thanks
>> >
>> >
>> > On 7/10/2025 10:53 AM, Josh Stompro via Evergreen-dev wrote:
>> >
>> > One piece of this puzzle that I would like to understand better is how
>> the bad actors are targeting our sites with thousands to hundreds of
>> thousands of unique IP endpoints each day.  And I just saw this article
>> come out about how 1 million browsers have installed extensions that turn
>> the users browser into scrapers.
>> >
>> >
>> https://arstechnica.com/security/2025/07/browser-extensions-turn-nearly-1-million-browsers-into-website-scraping-bots/
>> >
>> > Josh
>> >
>> >
>> > On Thu, Feb 13, 2025 at 3:49 PM Shula Link via Evergreen-dev <
>> evergreen-dev@list.evergreen-ils.org> wrote:
>> >>
>> >> It's not just Evergreen sites. I had to block all traffic from Hong
>> Kong to our system website after we had a greater than 10x increase in
>> visitors overnight. I tried doing it by IP, but they just changed, so it
>> ended up just being easier to just block everything.
>> >>
>> >> Shula Link (she/her)
>> >> Systems Services Librarian
>> >> Greater Clarks Hill Regional Library
>> >> sl...@columbiacountyga.gov | sl...@gchrl.org
>> >> 706-447-6702
>> >>
>> >>
>> >> On Thu, Feb 13, 2025 at 4:46 PM Blake Graham-Henderson via
>> Evergreen-dev <evergreen-dev@list.evergreen-ils.org> wrote:
>> >>>
>> >>> All,
>> >>>
>> >>> I almost replied with the arstechnica article that Josh linked when
>> the thread was started. But I decided not to put it out there until I had
>> setup a test system to see if I could get that code working. A tarpit, I
>> think, serves them right. And, of course, the whole issue is destined to
>> receive the fate of spam and spam filters forever and ever.
>> >>>
>> >>> It was a serendipitous timed article. It's existence at this moment
>> in time signals to me that this isn't a "just us" problem. It's the entire
>> planet.
>> >>>
>> >>> -Blake-
>> >>> Conducting Magic
>> >>> Will consume any data format
>> >>> MOBIUS
>> >>>
>> >>> On 2/13/2025 3:10 PM, Josh Stompro via Evergreen-dev wrote:
>> >>>
>> >>> Jeff, thanks for bringing this up on the list.
>> >>>
>> >>> We are seeing a lot of requests like
>> >>>  "GET /eg/opac/mylist/delete?anchor=record_184821&record=184821" from
>> never seen before IPs, and they make 1-12 requests and then stop.
>> >>>
>> >>> And they seem like they usually have a random out of date chrome
>> version in the user agent string.
>> >>> Chrome/88.0.4324.192
>> >>> Chrome/86.0.4240.75
>> >>>
>> >>> I've been trying to slow down the bots by collecting logs and
>> grabbing all the obvious patterns and blocking netblocks for non US
>> ranges.  ipinfo.io offers a free country & ASN database download that
>> I've been using to look up the ranges and countries. (
>> https://ipinfo.io/products/free-ip-database)  I would be happy to share
>> a link to our current blocklist that has 10K non US ranges.
>> >>>
>> >>> I've also been reporting the non US bot activity to
>> https://www.abuseipdb.com/ just to bring some visibility to these bad
>> bots.  I noticed initially that many of the IPs that we were getting hit
>> from didn't seem to be listed on any blocklists already, so I figured some
>> reporting might help.  I'm kind of curious if Evergreen sites are getting
>> hit from the same IPs, so an evergreen specific blocklist would be useful.
>> If you look up your bot IPs on abuseipdb.com you can see if I've already
>> reported any of them.
>> >>>
>> >>> I've also been making use of block lists from
>> https://iplists.firehol.org/
>> >>> Such as
>> >>> https://iplists.firehol.org/files/cleantalk_30d.ipset
>> >>> https://iplists.firehol.org/files/botscout_7d.ipset
>> >>> https://iplists.firehol.org/files/firehol_abusers_1d.netset
>> >>>
>> >>> We are using HAProxy so I did some looking into the CrowdSec HAProxy
>> Bouncer (https://docs.crowdsec.net/u/bouncers/haproxy/) but I'm not sure
>> that would help since these IPs don't seem to be on blocklists.  But I may
>> just not quite understand how CrowdSec is supposed to work.
>> >>>
>> >>> HAProxy Enterprise has a ReCaptcha module that I think would allow us
>> to feed any non-us connections that haven't connected before through a
>> recaptcha, but the price for HAProxy Enterprise is out of our budget.
>> https://www.haproxy.com/blog/announcing-haproxy-enterprise-3-0#new-captcha-and-saml-modules
>> >>>
>> >>> There is also a fairly up to date project for adding Captchas through
>> haproxy at
>> >>> https://github.com/ndbiaw/haproxy-protection, This looks promising
>> as a transparent method, requires new connections to perform a javascript
>> proof of work calculation before allowing access.  Could be a good
>> transparent way of handling it.
>> >>>
>> >>> We were taken out by ChatGTP bots back in December, which were a bit
>> easier to block the netblocks since they were not as spread out.  I
>> recently saw this article about how some people are fighting back against
>> bots that ignore robots.txt,
>> https://arstechnica.com/tech-policy/2025/01/ai-haters-build-tarpits-to-trap-and-trick-ai-scrapers-that-ignore-robots-txt/
>> >>>
>> >>> Josh
>> >>>
>> >>> On Mon, Jan 27, 2025 at 6:33 PM Jeff Davis via Evergreen-dev <
>> evergreen-dev@list.evergreen-ils.org> wrote:
>> >>>>
>> >>>> Hi folks,
>> >>>>
>> >>>> Our Evergreen environment has been experiencing a higher-than-usual
>> volume of unwanted bot traffic in recent months. Much of this traffic looks
>> like webcrawlers hitting Evergreen-specific URLs from an enormous number of
>> different IP addresses. Judging from discussion in IRC last week, it sounds
>> like other EG admins have been seeing the same thing. Does anyone have any
>> recommendations for managing this traffic and mitigating its impact?
>> >>>>
>> >>>> Some solutions that have been suggested/implemented so far:
>> >>>> - Geoblocking entire countries.
>> >>>> - Using Cloudflare's proxy service. There's some trickiness in
>> getting this to work with Evergreen.
>> >>>> - Putting certain OPAC pages behind a captcha.
>> >>>> - Deploying publicly-available blocklists of "bad bot"
>> IPs/useragents/etc. (good but limited, and not EG-specific).
>> >>>> - Teaching EG to identify and deal with bot traffic itself (but
>> arguably this should happen before the traffic hits Evergreen).
>> >>>>
>> >>>> My organization is currently evaluating CrowdSec as another possible
>> solution. Any opinions on any of these approaches?
>> >>>> --
>> >>>> Jeff Davis
>> >>>> BC Libraries Cooperative
>> >>>> _______________________________________________
>> >>>> Evergreen-dev mailing list
>> >>>> Evergreen-dev@list.evergreen-ils.org
>> >>>> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-dev
>> >>>
>> >>>
>> >>> _______________________________________________
>> >>> Evergreen-dev mailing list
>> >>> Evergreen-dev@list.evergreen-ils.org
>> >>> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-dev
>> >>>
>> >>>
>> >>> _______________________________________________
>> >>> Evergreen-dev mailing list
>> >>> Evergreen-dev@list.evergreen-ils.org
>> >>> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-dev
>> >>
>> >> _______________________________________________
>> >> Evergreen-dev mailing list
>> >> Evergreen-dev@list.evergreen-ils.org
>> >> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-dev
>> >
>> >
>> > _______________________________________________
>> > Evergreen-dev mailing list -- evergreen-dev@list.evergreen-ils.org
>> > To unsubscribe send an email to
>> evergreen-dev-le...@list.evergreen-ils.org
>> >
>> > --
>> > John Merriam
>> > Director of Information Technology
>> > Bibliomation, Inc.
>> > 24 Wooster Ave.
>> > Waterbury, CT 06708
>> > 203-577-4070
>> >
>> > _______________________________________________
>> > Evergreen-dev mailing list -- evergreen-dev@list.evergreen-ils.org
>> > To unsubscribe send an email to
>> evergreen-dev-le...@list.evergreen-ils.org
>> _______________________________________________
>> Evergreen-dev mailing list -- evergreen-dev@list.evergreen-ils.org
>> To unsubscribe send an email to
>> evergreen-dev-le...@list.evergreen-ils.org
>>
> _______________________________________________
> Evergreen-dev mailing list -- evergreen-dev@list.evergreen-ils.org
> To unsubscribe send an email to evergreen-dev-le...@list.evergreen-ils.org
>

_______________________________________________
Evergreen-dev mailing list -- evergreen-dev@list.evergreen-ils.org
To unsubscribe send an email to evergreen-dev-le...@list.evergreen-ils.org

[Evergreen-dev] Re: Problematic bot traffic

Reply via email to