If you need an Aspen site to be a test site for Anubis you can put LARL down on the list. I believe you can whitelist IPs with Anubis so our branches and catalog stations can skip the checks. Josh
On Thu, Jul 10, 2025 at 11:58 AM Jason Boyer via Evergreen-dev < evergreen-dev@list.evergreen-ils.org> wrote: > You can probably give up trying to look for IPs that send large numbers of > requests, what I'm seeing more and more are requests from these jerks or > their peers: https://bright data.com/ai/agent-browser who have > "residential proxies," i.e. the browser extensions mentioned in the story > Josh posted. They send literally a single http request from an IP (usually > on a US telecom provider's network so you can't reasonably block it) and > then the next request comes in from a different IP. > > The patch in the bug Mike posted helps significantly and unless users > trade a lot of direct links to search results they shouldn't be able to > even detect it. > > I'm looking into Anubis because we can put it in front of things more > easily than baking countermeasures into everything we host. Being > completely self-contained (i.e. it doesn't contact a remote server unless > you want to use a geo ip / AS number blocking service) I prefer it to > cloudflare, especially since their "good" bot blocking isn't affordable for > libraries. (I think the free level basically just doesn't allow things that > use a "real" bot UA to connect to your system, if you want to block > anything like a residential proxy you have to pay) > > Some thoughts on UA blocking since it's come up a little: don't forget you > can do things like block things claiming to be Chrome < 100 on Windows or > macOS, and limit Linux versions to a different limit. Chrome will go so far > on Windows and Macs as to tell you "ok look, it's been too long, I'm > restarting and then we'll go to whatever page," so very old versions on > those OSes are extremely unlikely. Linux can be a concern though, in case > you have libraries that have very old OPACs or similar. Also be sure to > block things like Windows 95 / 98 (but again, maybe some libraries have win > 7 opacs. :( ), old versions of Firefox, and anything claiming to be IE. > Things actually that old likely can't even complete an SSL handshake > anymore after some of the root certs have been rotated. A lot of proxies > are using randomly-constructed UAs to make it harder to bulk-block them. > > Jason > > -- > Jason Boyer > Senior System Administrator > Equinox Open Library Initiative > jbo...@equinoxoli.org > +1 (877) Open-ILS (673-6457) > https://equinoxOLI.org/ > > > On Thu, Jul 10, 2025 at 12:08 PM Mike Rylander via Evergreen-dev < > evergreen-dev@list.evergreen-ils.org> wrote: > >> Some things to consider, inline below... >> >> On Thu, Jul 10, 2025 at 11:25 AM John Merriam via Evergreen-dev >> <evergreen-dev@list.evergreen-ils.org> wrote: >> > >> > Hello. >> > >> > This will block Chrome older than 110 (over 2 years old) in Nginx: >> > >> > if ($http_user_agent ~* >> "(Chrome/10[0-9]\.|Chrome/[0-9][0-9]\.|Chrome/[0-9]\.)") { >> > return 403; >> > } >> > >> > which put a stop to it for now for us. >> > >> >> Please be careful. In addition to patrons with old browsers (there >> are plenty out there, unfortunately) there are some black-box kiosks >> out in the wild that are used for selfcheck and in-building OPAC >> machines which use an older Chrome (and are not free to upgrade). >> >> > Changing user agents is trivial though so finding other blockable >> patterns such as in URLs would be good. I didn't find a good pattern to >> the URLs yet but I was only able to look at that quickly. I plan on >> circling back around to that at some point. >> > >> > I don't think blocking by IP will work against what seems to be a >> distributed AI botnet. A few months ago we had our data center partners >> block all non-US IPs. That worked for a few months but even that doesn't >> work anymore. We see AI bot traffic coming from US residential IP ranges. >> A gigantic question I have is how are they appearing to come from >> residential IPs and how could that be stopped? >> > >> > We plan to profile Evergreen looking for slow code that could maybe be >> improved but that will be a big project. >> > >> >> I invite more eyes, of course, but "big project" is a bit of an >> understatement. ;) >> >> Please be careful when testing something that seems "slow" in >> isolation -- making code X 10% faster will often make >> seemingly-unrelated code Y 90% slower. >> >> > We also plan to hook a WAF with machine learning into Nginx and see >> what that can do. Another big project. >> > >> > We may also put captcha on more parts of the OPAC. We have someone >> working on that. >> > >> >> Have you looked at https://bugs.launchpad.net/evergreen/+bug/2113979? >> With some refinement of the URL space where the not-a-bot cookie is >> required, this is shaping up to be a good first-order bot killer. >> >> > I can allocate more resources to the OPAC but that seems like letting >> them win and they will probably eventually exhaust that as well. >> > >> > Anubis is a nuclear option I would like to avoid. >> > >> >> I'm curious why you see this as a nuclear option. Granted, most AI >> scrapers right now (at least, AFAICT) seem to be essentially >> stateless, so it may be overkill compared to the LP bug linked above, >> but it's fairly straight-forward to set up and maintain. The only >> drawback right now is that you have to use just one instance, which >> could become a bottleneck in a very "wide" EG setup. >> >> > Also don't want to turn to something like Cloudflare. >> > >> >> It's certainly not cost effective for the Library space... >> >> > Please do share any findings and I will as well. >> > >> > Thanks >> > >> > >> > On 7/10/2025 10:53 AM, Josh Stompro via Evergreen-dev wrote: >> > >> > One piece of this puzzle that I would like to understand better is how >> the bad actors are targeting our sites with thousands to hundreds of >> thousands of unique IP endpoints each day. And I just saw this article >> come out about how 1 million browsers have installed extensions that turn >> the users browser into scrapers. >> > >> > >> https://arstechnica.com/security/2025/07/browser-extensions-turn-nearly-1-million-browsers-into-website-scraping-bots/ >> > >> > Josh >> > >> > >> > On Thu, Feb 13, 2025 at 3:49 PM Shula Link via Evergreen-dev < >> evergreen-dev@list.evergreen-ils.org> wrote: >> >> >> >> It's not just Evergreen sites. I had to block all traffic from Hong >> Kong to our system website after we had a greater than 10x increase in >> visitors overnight. I tried doing it by IP, but they just changed, so it >> ended up just being easier to just block everything. >> >> >> >> Shula Link (she/her) >> >> Systems Services Librarian >> >> Greater Clarks Hill Regional Library >> >> sl...@columbiacountyga.gov | sl...@gchrl.org >> >> 706-447-6702 >> >> >> >> >> >> On Thu, Feb 13, 2025 at 4:46 PM Blake Graham-Henderson via >> Evergreen-dev <evergreen-dev@list.evergreen-ils.org> wrote: >> >>> >> >>> All, >> >>> >> >>> I almost replied with the arstechnica article that Josh linked when >> the thread was started. But I decided not to put it out there until I had >> setup a test system to see if I could get that code working. A tarpit, I >> think, serves them right. And, of course, the whole issue is destined to >> receive the fate of spam and spam filters forever and ever. >> >>> >> >>> It was a serendipitous timed article. It's existence at this moment >> in time signals to me that this isn't a "just us" problem. It's the entire >> planet. >> >>> >> >>> -Blake- >> >>> Conducting Magic >> >>> Will consume any data format >> >>> MOBIUS >> >>> >> >>> On 2/13/2025 3:10 PM, Josh Stompro via Evergreen-dev wrote: >> >>> >> >>> Jeff, thanks for bringing this up on the list. >> >>> >> >>> We are seeing a lot of requests like >> >>> "GET /eg/opac/mylist/delete?anchor=record_184821&record=184821" from >> never seen before IPs, and they make 1-12 requests and then stop. >> >>> >> >>> And they seem like they usually have a random out of date chrome >> version in the user agent string. >> >>> Chrome/88.0.4324.192 >> >>> Chrome/86.0.4240.75 >> >>> >> >>> I've been trying to slow down the bots by collecting logs and >> grabbing all the obvious patterns and blocking netblocks for non US >> ranges. ipinfo.io offers a free country & ASN database download that >> I've been using to look up the ranges and countries. ( >> https://ipinfo.io/products/free-ip-database) I would be happy to share >> a link to our current blocklist that has 10K non US ranges. >> >>> >> >>> I've also been reporting the non US bot activity to >> https://www.abuseipdb.com/ just to bring some visibility to these bad >> bots. I noticed initially that many of the IPs that we were getting hit >> from didn't seem to be listed on any blocklists already, so I figured some >> reporting might help. I'm kind of curious if Evergreen sites are getting >> hit from the same IPs, so an evergreen specific blocklist would be useful. >> If you look up your bot IPs on abuseipdb.com you can see if I've already >> reported any of them. >> >>> >> >>> I've also been making use of block lists from >> https://iplists.firehol.org/ >> >>> Such as >> >>> https://iplists.firehol.org/files/cleantalk_30d.ipset >> >>> https://iplists.firehol.org/files/botscout_7d.ipset >> >>> https://iplists.firehol.org/files/firehol_abusers_1d.netset >> >>> >> >>> We are using HAProxy so I did some looking into the CrowdSec HAProxy >> Bouncer (https://docs.crowdsec.net/u/bouncers/haproxy/) but I'm not sure >> that would help since these IPs don't seem to be on blocklists. But I may >> just not quite understand how CrowdSec is supposed to work. >> >>> >> >>> HAProxy Enterprise has a ReCaptcha module that I think would allow us >> to feed any non-us connections that haven't connected before through a >> recaptcha, but the price for HAProxy Enterprise is out of our budget. >> https://www.haproxy.com/blog/announcing-haproxy-enterprise-3-0#new-captcha-and-saml-modules >> >>> >> >>> There is also a fairly up to date project for adding Captchas through >> haproxy at >> >>> https://github.com/ndbiaw/haproxy-protection, This looks promising >> as a transparent method, requires new connections to perform a javascript >> proof of work calculation before allowing access. Could be a good >> transparent way of handling it. >> >>> >> >>> We were taken out by ChatGTP bots back in December, which were a bit >> easier to block the netblocks since they were not as spread out. I >> recently saw this article about how some people are fighting back against >> bots that ignore robots.txt, >> https://arstechnica.com/tech-policy/2025/01/ai-haters-build-tarpits-to-trap-and-trick-ai-scrapers-that-ignore-robots-txt/ >> >>> >> >>> Josh >> >>> >> >>> On Mon, Jan 27, 2025 at 6:33 PM Jeff Davis via Evergreen-dev < >> evergreen-dev@list.evergreen-ils.org> wrote: >> >>>> >> >>>> Hi folks, >> >>>> >> >>>> Our Evergreen environment has been experiencing a higher-than-usual >> volume of unwanted bot traffic in recent months. Much of this traffic looks >> like webcrawlers hitting Evergreen-specific URLs from an enormous number of >> different IP addresses. Judging from discussion in IRC last week, it sounds >> like other EG admins have been seeing the same thing. Does anyone have any >> recommendations for managing this traffic and mitigating its impact? >> >>>> >> >>>> Some solutions that have been suggested/implemented so far: >> >>>> - Geoblocking entire countries. >> >>>> - Using Cloudflare's proxy service. There's some trickiness in >> getting this to work with Evergreen. >> >>>> - Putting certain OPAC pages behind a captcha. >> >>>> - Deploying publicly-available blocklists of "bad bot" >> IPs/useragents/etc. (good but limited, and not EG-specific). >> >>>> - Teaching EG to identify and deal with bot traffic itself (but >> arguably this should happen before the traffic hits Evergreen). >> >>>> >> >>>> My organization is currently evaluating CrowdSec as another possible >> solution. Any opinions on any of these approaches? >> >>>> -- >> >>>> Jeff Davis >> >>>> BC Libraries Cooperative >> >>>> _______________________________________________ >> >>>> Evergreen-dev mailing list >> >>>> Evergreen-dev@list.evergreen-ils.org >> >>>> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-dev >> >>> >> >>> >> >>> _______________________________________________ >> >>> Evergreen-dev mailing list >> >>> Evergreen-dev@list.evergreen-ils.org >> >>> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-dev >> >>> >> >>> >> >>> _______________________________________________ >> >>> Evergreen-dev mailing list >> >>> Evergreen-dev@list.evergreen-ils.org >> >>> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-dev >> >> >> >> _______________________________________________ >> >> Evergreen-dev mailing list >> >> Evergreen-dev@list.evergreen-ils.org >> >> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-dev >> > >> > >> > _______________________________________________ >> > Evergreen-dev mailing list -- evergreen-dev@list.evergreen-ils.org >> > To unsubscribe send an email to >> evergreen-dev-le...@list.evergreen-ils.org >> > >> > -- >> > John Merriam >> > Director of Information Technology >> > Bibliomation, Inc. >> > 24 Wooster Ave. >> > Waterbury, CT 06708 >> > 203-577-4070 >> > >> > _______________________________________________ >> > Evergreen-dev mailing list -- evergreen-dev@list.evergreen-ils.org >> > To unsubscribe send an email to >> evergreen-dev-le...@list.evergreen-ils.org >> _______________________________________________ >> Evergreen-dev mailing list -- evergreen-dev@list.evergreen-ils.org >> To unsubscribe send an email to >> evergreen-dev-le...@list.evergreen-ils.org >> > _______________________________________________ > Evergreen-dev mailing list -- evergreen-dev@list.evergreen-ils.org > To unsubscribe send an email to evergreen-dev-le...@list.evergreen-ils.org >
_______________________________________________ Evergreen-dev mailing list -- evergreen-dev@list.evergreen-ils.org To unsubscribe send an email to evergreen-dev-le...@list.evergreen-ils.org