Mitigations to cope with attacks of abusive crawlers

Mario Domenech Goulart via Chicken-users Sat, 29 Mar 2025 08:31:32 -0700

Hi,

In the past months the call-cc.org server has been off and on target of
attacks of disrespectful crawlers that abuse our services, making them
unavailable for some periods of time.


It can be very hard to protect services from those kinds of attacks, as
crawlers:

* Ignore robots.txt

* Use unsuspecting User-Agent settings (you cannot tell them apart
  from regular users)

* Use thousands of different IP addresses

* Perform thousands of requests in parallel

Blocking individual IP numbers is not feasible, as the massive amount of
requests in parallel from different IP addresses exhaust the server
resources before we manage to block enough addresses to keep the system
load at a bearable level.  We have to resort to banning in /24 blocks
(256 addresses in one go).  To illustrate the problem using numbers,
some attacks put the system load close to 100.  During the last attack
we ended up blocking close to six million IP addresses in one day, four
million in the first four hours.

In our case, we host services that are really not suitable for crawlers,
like gitweb (code browser) and Trac (bug tracker and also code browser),
as they can be quite resource demanding.  They are "protected" by rules
in robots.txt, but of course the alleged "protection" is useless if
crawlers don't respect the rules.

As a result, the amount of requests to gitweb and Trac exhaust the
resources of the server, putting it in a denial of service situation.

We have a couple of mitigations in place to avoid the complete
disruption of services:

* The code browser of Trac has been disabled.  Trac allowed users to
  browse the code from the svn repository, which is quite big.  As an
  alternative to the Trac browser, users still can browse the svn
  repository via HTTP through https://code.call-cc.org/svn/chicken-eggs/
  (use anonymous as user and empty password -- just press ENTER at the
  password prompt).

* Gitweb has been replaced by a little hack [0] that generates static
  pages out of the content of git repositories served by call-cc.org
  [1].  The static representation of the content of git repositories is
  much more limited than what gitweb offers, but hopefully it'll be
  enough in our case (people can always clone the repositories, anyway).
  Serving static pages is much cheaper than serving dynamic content
  through gitweb.  Additionally, the static pages contain a trap for
  crawlers: hidden links to a directory protected by robots.txt --
  whoever visits that link will get blocked (hopefully only crawlers, as
  users should not even see the trap links).

* The code browser for the git representation of the egg caches [2] has
  been disabled. The problem here is the size of repositories.  Even
  though serving static pages is cheap in terms of CPU and memory use,
  the amount of data fetched by crawlers can become a problem in terms
  of traffic.  We'll have to see how effective the crawler traps will be
  before reenabling this service.

[0] https://github.com/mario-goulart/git2html
[1] http://code.call-cc.org/githtml/
[2] https://wiki.call-cc.org/how-to-obtain-the-source-code-of-eggs#using-git

Attacks like the ones we are suffering have become common nowadays.  We
are not the only ones affected.  The links below show other reports:

* https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/
* https://news.ycombinator.com/item?id=43422413
* https://status.sr.ht/issues/2025-03-17-git.sr.ht-llms/
* https://www.emacswiki.org/emacs/2025-02-19
* 
https://drewdevault.com/2025/03/17/2025-03-17-Stop-externalizing-your-costs-on-me.html
* https://lists.yoctoproject.org/g/yocto/message/65021
* https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse/

Apologies for the inconvenience that the unavailability of services
might have caused to you.

All the best.
Mario
-- 
https://parenteses.org/mario

Mitigations to cope with attacks of abusive crawlers

Reply via email to