Re: CI congestion/starvation

Ben Cooksley Fri, 13 Feb 2026 03:37:59 -0800

On Thu, Feb 12, 2026 at 11:43 PM Vlad Zahorodnii <[email protected]>
wrote:


> Hello,
>

Hi all,


>
> CI congestion is a pretty painful problem at the moment. In the event of
> a version bump or a release, a lot of CI jobs can be created, which
> slows down CI significantly. Version bumps in Plasma, Gear, and so on
> can be felt everywhere. For example, if a merge request needs to run CI
> to get merged, it can take hours before it's merge request's turn to run
> its jobs.
>
> For that past 3 days, things have been really bad. A merge request could
> get stuck waiting for CI for 5-10 hours, some even timed out.
>
> The current CI experience is quite painful during such rush hours. It
> will be great if we could work something out. Maybe we could dynamically
> allocate additional CI runners when we know that CI is about to get
> really really busy? or perhaps implement some CI sharding scheme to
> contain heavy CI workloads like version bumps or mass rebuilds so other
> projects don't experience CI starvation?
>

There are a couple of things driving the experience you've seen over the
past week.

The main driver of this has been instability of the physical hardware
nodes. At one point earlier this week, only one of the four nodes was left
functional and processing jobs, with the other three having fallen over.
This has since been corrected, however the underlying instability has been
an issue in the past, and it appears we're currently experiencing a period
of greater instability than normal.

We currently use 4 Hetzner AX52 servers for our CI nodes, and in the past
Hetzner have replaced 2 of the machines motherboards preemptively due to
known stability issues (which didn't affect us at the time).
In the last couple of months they have withdrawn the model from sale
completely, so I suspect they are having issues again.

This hasn't been helped by the weekly rebuilds of Gear, in addition to
release activities taking place (the first half of any given month tends to
be more resource demanding due to the nature of these schedules).

Additionally, due to a number of base image rebuilds (Qt version updates
among other things) we've had a greater number of rebuilds (including
running of seed jobs) than normal being required.

With regards to your proposed fixes, i'm afraid dynamically allocating
additional runners is not really possible with the current setup as we rely
on running on bare metal due to our VM based setup.  Completing jobs
promptly also relies on the usage of approximately half a terabyte of
storage on each physical node, to cache base VM images, dependency archives
and other caches. This is something that new temporarily provisioned nodes
simply wouldn't have.

In terms of whether version bumps require CI to run - this can be achieved
through ci.skip and then running a seed job, which occupies just one build
slot and skips tests.

Resource utilisation wise, i've not looked into whether there has been a
significant bump in the number of jobs, but over the past year some
additional CD support has been added so that indicates some trouble there.
System utilisation stats since the start of February are rather telling

          full_path           |    time_used     | job_count
------------------------------+------------------+-----------
graphics/krita               | 228:03:25.300251 |       451
plasma/kwin                  | 164:34:16.119352 |      2032
plasma/plasma-workspace      | 136:03:27.534157 |      1044
multimedia/kdenlive          | 113:13:39.087112 |       566
network/ruqola               | 89:02:24.704107  |       369
pim/messagelib               | 83:58:38.936333  |       648
graphics/drawy               | 77:38:55.987716  |      2101
network/neochat              | 76:37:12.290371  |      1069
plasma/plasma-desktop        | 62:27:04.468738  |       619
utilities/kate               | 59:05:46.517227  |       438
office/kmymoney              | 56:58:08.349153  |       241
education/labplot            | 55:19:06.879103  |       423
frameworks/kio               | 43:04:55.33379   |       744
kde-linux/kde-linux-packages | 39:40:15.332536  |        63
libraries/ktextaddons        | 31:46:16.168865  |       356
office/calligra              | 31:33:35.74806   |        62
kdevelop/kdevelop            | 30:57:38.655126  |       113
education/kstars             | 30:04:42.328143  |        71
sysadmin/craft-ci            | 28:36:13.927477  |        53
bjordan/kdenlive             | 24:13:20.564568  |       191


Contrast that with the full month of September 2025:

           full_path           |    time_used     | job_count
-------------------------------+------------------+-----------
graphics/krita                | 382:38:14.237481 |       882
plasma/kwin                   | 365:27:01.96681  |      3531
multimedia/kdenlive           | 294:59:01.874773 |      1299
network/neochat               | 233:26:31.601019 |      2298
packaging/flatpak-kde-runtime | 225:01:38.542109 |       269
network/ruqola                | 219:56:34.415103 |       946
plasma/plasma-workspace       | 194:15:09.699541 |      2014
sysadmin/ci-management        | 118:10:28.754041 |       233
libraries/ktextaddons         | 105:25:58.724964 |      1084
office/kmymoney               | 101:50:27.612055 |       572
plasma/plasma-desktop         | 101:25:05.185953 |      1148
kde-linux/kde-linux-packages  | 101:17:06.284095 |       144
education/rkward              | 84:32:07.118316  |      1033
kdevelop/kdevelop             | 84:04:58.804139  |       270
utilities/kate                | 83:49:15.561603  |       663
frameworks/kio                | 76:46:04.377597  |      1032
graphics/okular               | 74:58:19.8886    |       628
pim/akonadi                   | 62:23:56.252831  |       531
pim/itinerary                 | 60:05:52.269941  |       578
network/kaidan                | 56:37:44.446695  |      1773

I already had scheduled for this year a complete replacement of our
existing CI nodes given their instability, which will include increases in
the number of nodes (exact specification to be determined, but it is
possible we go for something that is slightly slower than what we currently
have so as to have more nodes, rather than having more computationally
capable ones but less of them - comparing EX44 and EX63)


>
> Regards,
> Vlad
>
>
Thanks,
Ben

Re: CI congestion/starvation

Reply via email to