Hihi,
Asking just in case, is something like Anubis[0] desirable? The Gnome
forge has implemented it and from what I've heard they've had good
results with it.
[0]: https://github.com/TecharoHQ/anubis
Cheers!
On 3/29/25 12:22, Mario Domenech Goulart via Chicken-users wrote:
Hi,
In the past months the call-cc.org server has been off and on target of
attacks of disrespectful crawlers that abuse our services, making them
unavailable for some periods of time.
It can be very hard to protect services from those kinds of attacks, as
crawlers:
* Ignore robots.txt
* Use unsuspecting User-Agent settings (you cannot tell them apart
from regular users)
* Use thousands of different IP addresses
* Perform thousands of requests in parallel
Blocking individual IP numbers is not feasible, as the massive amount of
requests in parallel from different IP addresses exhaust the server
resources before we manage to block enough addresses to keep the system
load at a bearable level. We have to resort to banning in /24 blocks
(256 addresses in one go). To illustrate the problem using numbers,
some attacks put the system load close to 100. During the last attack
we ended up blocking close to six million IP addresses in one day, four
million in the first four hours.
In our case, we host services that are really not suitable for crawlers,
like gitweb (code browser) and Trac (bug tracker and also code browser),
as they can be quite resource demanding. They are "protected" by rules
in robots.txt, but of course the alleged "protection" is useless if
crawlers don't respect the rules.
As a result, the amount of requests to gitweb and Trac exhaust the
resources of the server, putting it in a denial of service situation.
We have a couple of mitigations in place to avoid the complete
disruption of services:
* The code browser of Trac has been disabled. Trac allowed users to
browse the code from the svn repository, which is quite big. As an
alternative to the Trac browser, users still can browse the svn
repository via HTTP through https://code.call-cc.org/svn/chicken-eggs/
(use anonymous as user and empty password -- just press ENTER at the
password prompt).
* Gitweb has been replaced by a little hack [0] that generates static
pages out of the content of git repositories served by call-cc.org
[1]. The static representation of the content of git repositories is
much more limited than what gitweb offers, but hopefully it'll be
enough in our case (people can always clone the repositories, anyway).
Serving static pages is much cheaper than serving dynamic content
through gitweb. Additionally, the static pages contain a trap for
crawlers: hidden links to a directory protected by robots.txt --
whoever visits that link will get blocked (hopefully only crawlers, as
users should not even see the trap links).
* The code browser for the git representation of the egg caches [2] has
been disabled. The problem here is the size of repositories. Even
though serving static pages is cheap in terms of CPU and memory use,
the amount of data fetched by crawlers can become a problem in terms
of traffic. We'll have to see how effective the crawler traps will be
before reenabling this service.
[0] https://github.com/mario-goulart/git2html
[1] http://code.call-cc.org/githtml/
[2] https://wiki.call-cc.org/how-to-obtain-the-source-code-of-eggs#using-git
Attacks like the ones we are suffering have become common nowadays. We
are not the only ones affected. The links below show other reports:
* https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/
* https://news.ycombinator.com/item?id=43422413
* https://status.sr.ht/issues/2025-03-17-git.sr.ht-llms/
* https://www.emacswiki.org/emacs/2025-02-19
*
https://drewdevault.com/2025/03/17/2025-03-17-Stop-externalizing-your-costs-on-me.html
* https://lists.yoctoproject.org/g/yocto/message/65021
* https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse/
Apologies for the inconvenience that the unavailability of services
might have caused to you.
All the best.
Mario