Hi, In the past months the call-cc.org server has been off and on target of attacks of disrespectful crawlers that abuse our services, making them unavailable for some periods of time.
It can be very hard to protect services from those kinds of attacks, as crawlers: * Ignore robots.txt * Use unsuspecting User-Agent settings (you cannot tell them apart from regular users) * Use thousands of different IP addresses * Perform thousands of requests in parallel Blocking individual IP numbers is not feasible, as the massive amount of requests in parallel from different IP addresses exhaust the server resources before we manage to block enough addresses to keep the system load at a bearable level. We have to resort to banning in /24 blocks (256 addresses in one go). To illustrate the problem using numbers, some attacks put the system load close to 100. During the last attack we ended up blocking close to six million IP addresses in one day, four million in the first four hours. In our case, we host services that are really not suitable for crawlers, like gitweb (code browser) and Trac (bug tracker and also code browser), as they can be quite resource demanding. They are "protected" by rules in robots.txt, but of course the alleged "protection" is useless if crawlers don't respect the rules. As a result, the amount of requests to gitweb and Trac exhaust the resources of the server, putting it in a denial of service situation. We have a couple of mitigations in place to avoid the complete disruption of services: * The code browser of Trac has been disabled. Trac allowed users to browse the code from the svn repository, which is quite big. As an alternative to the Trac browser, users still can browse the svn repository via HTTP through https://code.call-cc.org/svn/chicken-eggs/ (use anonymous as user and empty password -- just press ENTER at the password prompt). * Gitweb has been replaced by a little hack [0] that generates static pages out of the content of git repositories served by call-cc.org [1]. The static representation of the content of git repositories is much more limited than what gitweb offers, but hopefully it'll be enough in our case (people can always clone the repositories, anyway). Serving static pages is much cheaper than serving dynamic content through gitweb. Additionally, the static pages contain a trap for crawlers: hidden links to a directory protected by robots.txt -- whoever visits that link will get blocked (hopefully only crawlers, as users should not even see the trap links). * The code browser for the git representation of the egg caches [2] has been disabled. The problem here is the size of repositories. Even though serving static pages is cheap in terms of CPU and memory use, the amount of data fetched by crawlers can become a problem in terms of traffic. We'll have to see how effective the crawler traps will be before reenabling this service. [0] https://github.com/mario-goulart/git2html [1] http://code.call-cc.org/githtml/ [2] https://wiki.call-cc.org/how-to-obtain-the-source-code-of-eggs#using-git Attacks like the ones we are suffering have become common nowadays. We are not the only ones affected. The links below show other reports: * https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/ * https://news.ycombinator.com/item?id=43422413 * https://status.sr.ht/issues/2025-03-17-git.sr.ht-llms/ * https://www.emacswiki.org/emacs/2025-02-19 * https://drewdevault.com/2025/03/17/2025-03-17-Stop-externalizing-your-costs-on-me.html * https://lists.yoctoproject.org/g/yocto/message/65021 * https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse/ Apologies for the inconvenience that the unavailability of services might have caused to you. All the best. Mario -- https://parenteses.org/mario