Package: prometheus
Version: 2.24.1+ds-1+b7
Severity: important
In the systemd unit provided by this Debian package (in
`debian/service`), we have this:
[Service]
Restart=on-failure
This makes it so that systemd will try to restart prometheus if it
exits for whatever reason other than exit code 0. This may be nice:
you may want it to retry if it crashes or something.
But I had a situation where I mistakenly pushed a broken config
(broken rules, more accurately) to prometheus. The gory details are in
this incident report:
https://gitlab.torproject.org/tpo/tpa/team/-/issues/40939
... but the gist of it is that systemd repeatedly tried to restart the
service and failed. And retried, and retried... this would fill the
disk not only with logs of those attempts, but it would also grow the
WAL every time (which is a separate issue).
(Normally, prometheus fails to start and doesn't mess with its data if
the config file is broken, since 2.20:
https://github.com/prometheus/prometheus/pull/7399
... but it seems this to be a courtesy that is not extended to
rules. will file an upstream bug on that one.)
Anyways, point is maybe we shouldn't restart so aggressively. Maybe
`Restart=on-abnormal` or `Restart=on-abort` would be better? That way
systemd wouldn't try to restart prometheus on syntax errors, and
correctly fail instead of retrying the service forever.
For now, I added a local override (`Restart=no`) to get through my
day, but I'd be happy to have a discussion on the best way to deal
with this.
(I first considered limiting the number of retries to something more
decent than the current "infinity", but there isn't a setting directly
for that in systemd. There *are* things like
`StartLimitIntervalSec=interval` and `StartLimitBurst=burst` but those
were not triggered by my incident, because prometheus would take about
3 seconds to startup, which is above the default 5 restarts in 10
seconds default. So maybe that's another way to fix this, ie. raise
the StartLimitIntervalSec (to, say 30 seconds) or lower the
StartLimitBurst.)
Either way, I think we can expect prometheus to return proper exit
statuses and, in those case *not* restart prometheus, so I would
propose `Restart=on-abort` instead of the `on-failure`.
(Interestingly, the Restart=on-failure was introduced explicitly to
handle situations like this:
https://salsa.debian.org/go-team/packages/prometheus/-/commit/1a61bbb194
Excerpt:
> Subject: Change systemd service Restart directive from always to on-failure
>
> The always value is unusual, as it ignores successful exits. The
> prometheus daemon can also be requested to exit from its API, that
> should be honored.
... but it didn't take into account non-transient failures like
configuration errors.)
I'll probably followup with a MR on the package as well.
-- System Information:
Debian Release: 11.5
APT prefers stable-security
APT policy: (500, 'stable-security'), (500, 'stable-debug'), (500, 'stable'),
(1, 'unstable'), (1, 'testing')
Architecture: amd64 (x86_64)
Kernel: Linux 5.10.0-19-amd64 (SMP w/4 CPU threads)
Kernel taint flags: TAINT_PROPRIETARY_MODULE, TAINT_FIRMWARE_WORKAROUND,
TAINT_OOT_MODULE, TAINT_UNSIGNED_MODULE
Locale: LANG=fr_CA.UTF-8, LC_CTYPE=fr_CA.UTF-8 (charmap=UTF-8), LANGUAGE not set
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled
Versions of packages prometheus depends on:
ii adduser 3.118
ii fonts-glyphicons-halflings 1.009~3.4.1+dfsg-2
ii init-system-helpers 1.60
ii libc62.31-13+deb11u4
pn libjs-bootstrap4
pn libjs-eonasdan-bootstrap-datetimepicker
ii libjs-jquery 3.5.1+dfsg+~3.5.5-7
ii libjs-jquery-hotkeys 0~20130707+git2d51e3a9+dfsg-2.1
pn libjs-moment
pn libjs-moment-timezone
pn libjs-mustache
pn libjs-popper.js
pn libjs-rickshaw
Versions of packages prometheus recommends:
pn prometheus-node-exporter
prometheus suggests no packages.