Hi all, thanks for your time! Please let me know if this isn't the right
spot for this, or if I should file a bug.

I'm on debian buster, package versions for ones in question are:
systemd - 241-7~deb10u8
prometheus-node-exporter - 0.17.0+ds-3+b11
dbus - 1.12.20-0+deb10u1
pacemaker - 2.0.1-5+deb10u1

This has been a bit of a head scratcher for me and some others for some
time now. Quick environmental description; I have two-node clusters running
buster,
supporting pacemaker with some systemd and other typed resources, and
exporting metrics via prometheus-node-exporter.
There is also a service that once a minute, from the "active" node in the 2
node cluster, logs into the other node via rsync over ssh.

Bizarrely, with enough uptime systemd-logind, prometheus-node-exporter, and
pacemaker, will all start to log errors of the type:
                                      "Failed to activate service
'org.freedesktop.systemd1'".
This of course causes pacemaker to snipe the "failed" node since the
resources there can no longer pass the status checks.

Uptime varies wildly between reproductions. Currently we've observed 5,
with uptimes of 45,48,62,81,100 days respectively. The only difference in
those incidents
is that the machine with an uptime of 48 days before it was hit was not
running any of the pacemaker resources. I'm going to include some of the
logs from this machine
 in particular as it seems to be the most "pure" reproduction of the issue,
and is the only one that survived pacemaker's ire.

The first incidence of the error from systemd-logind:
Jul 31 19:50:50 nosferatu systemd-logind[4042]: New session 125887 of user
rsync-user.
Jul 31 19:50:50 nosferatu systemd-logind[4042]: Session 125887 logged out.
Waiting for processes to exit.
Jul 31 19:50:50 nosferatu systemd-logind[4042]: Removed session 125887.
Jul 31 19:51:50 nosferatu systemd-logind[4042]: New session 125889 of user
rsync-user.
Jul 31 19:51:51 nosferatu systemd-logind[4042]: Session 125889 logged out.
Waiting for processes to exit.
Jul 31 19:51:51 nosferatu systemd-logind[4042]: Removed session 125889.
Jul 31 19:52:52 nosferatu systemd-logind[4042]: Failed to start user
service 'user@999.service', ignoring: Failed to activate service
'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Jul 31 19:53:17 nosferatu systemd-logind[4042]: Failed to start session
scope session-125891.scope: Failed to activate service
'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Jul 31 19:53:42 nosferatu systemd-logind[4042]: Failed to stop user service
'user@999.service', ignoring: Failed to activate service
'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Jul 31 19:54:07 nosferatu systemd-logind[4042]: Failed to start user
service 'user@999.service', ignoring: Failed to activate service
'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Jul 31 19:54:32 nosferatu systemd-logind[4042]: Failed to start session
scope session-125892.scope: Failed to activate service
'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Jul 31 19:54:57 nosferatu systemd-logind[4042]: Failed to stop user service
'user@999.service', ignoring: Failed to activate service
'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)

Dbus-daemon logs a similar error:
Jun 20 12:45:01 nosferatu systemd[1]: Started D-Bus System Message Bus.
Jul 31 19:52:27 nosferatu dbus-daemon[4047]: [system] Failed to activate
service 'org.freedesktop.systemd1': timed out
(service_start_timeout=25000ms)

Prometheus-node-exporter reports the same:
Jun 20 12:45:01 nosferatu prometheus-node-exporter[4025]:
time="2022-06-20T09:45:01-07:00" level=info msg=" - vmstat"
source="node_exporter.go:104"
Jun 20 12:45:01 nosferatu prometheus-node-exporter[4025]:
time="2022-06-20T09:45:01-07:00" level=info msg=" - xfs"
source="node_exporter.go:104"
Jun 20 12:45:01 nosferatu prometheus-node-exporter[4025]:
time="2022-06-20T09:45:01-07:00" level=info msg=" - zfs"
source="node_exporter.go:104"
Jun 20 12:45:01 nosferatu prometheus-node-exporter[4025]:
time="2022-06-20T09:45:01-07:00" level=info msg="Listening on :9100"
source="node_exporter.go:170"
Jul 31 19:52:27 nosferatu prometheus-node-exporter[4025]:
time="2022-07-31T16:52:27-07:00" level=error msg="ERROR: systemd collector
failed after 0.351864s: couldn't get units: Failed to activate service
'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)"
source="collector.go:132"

>From the logs, it appears that prometheus-node-exporter is the first one to
hit the problem, and then is joined in a degraded waltz by systemd-logind
until 1 second after midnight, when;
Aug 01 00:00:01 nosferatu dbus-daemon[4047]: [system] Successfully
activated service 'org.freedesktop.systemd1'

Then after this, systemd-logind and prometheus-node-exporter stop failing
with the timeout message, but something interesting happens; systemd-logind
loses the ability to fully close
the new sessions being opened, and eventually the once-a-minute login
process causes the node to max out on user sessions.
You can sort of see this in the logs, since no longer are sessions removed
after the "Waiting for processes to exit message";
Aug 07 13:48:50 nosferatu systemd-logind[4042]: New session 136763 of user
rsync-user.
Aug 07 13:48:50 nosferatu systemd-logind[4042]: Session 136763 logged out.
Waiting for processes to exit.
Aug 07 13:49:50 nosferatu systemd-logind[4042]: New session 136764 of user
rsync-user.
Aug 07 13:49:50 nosferatu systemd-logind[4042]: Session 136764 logged out.
Waiting for processes to exit. <-8192 total user sessions active here
Aug 21 21:12:51 nosferatu systemd-logind[4042]: Power key pressed. <- When
I rebooted the machine over IPMI so I could log in :(
Aug 21 21:12:51 nosferatu systemd-logind[4042]: Powering Off...

I know from other instances that performing `start-stop-daemon --stop
--quiet --pidfile <PIDOFASERVICE>` causes systemd to reload with the same
message in dbus of
"Successfully activated service"..., so I imagine this resurrection
occurred due to some nightly service, though what exactly I cannot tell,
since logs had been rotated before I started to look at the system.

Of additional though maybe spurious interest is the fact that login
sessions increment oddly; scanning the logs, the session number goes up by
2 for every new connection,
except on the 10th connection or so, where it goes up by 3.

I have tried checking for session leaks with loginctl, bus connection leaks
with busctl, confirmed that the OOM killer wasn't active, confirmed that
CPU util was within reason,
 and most other common-sense checks. The systems are otherwise healthy, and
so far this hasn't happened on the same machine twice.

As you may imagine, I'm thoroughly flummoxed by this issue, and insofar
I've been unable to reproduce it by doing things like increasing the
interval of the prometheus collector or the rsync service.  Normally, I'd
just update
and presume that'd solve the issue, but without knowledge of *why *this
issue is occuring I'm hesitant to call anything a fix.

Does this ring any bells to anyone? Any known bugs I'm hitting here with
some subcomponent? Does anyone have any advice for useful debug steps that
could get me closer to a solution?
I've also got many, many, more logs available if there's anything that may
be of use.

Cheers, and thanks ever so much for your time and attention!
Christian

Reply via email to