From 10/17 to 10/19, the list VM's SMTP server experienced repeated, periodic
segfaults. This is despite the fact that the server software in question,
Haraka, is written for Node.js (a memory safe environment) with no native
extensions. The segfaults would sometimes cause Mailman’s outgoing runner to
get wedged (apparently it has no timeout), so new messages were added to the
archives but weren’t delivered to anyone until I manually restarted Mailman,
which I did a few times.
I honestly have no idea what caused the segfaults; opening the core dumps in a
debugger was entirely unhelpful. I don’t even see any package updates in the
days before the crashes started. However, I did learn that Node.js no longer
officially supports 32-bit x86, which I had been running it on for...
historical reasons. On 10/19 I switched to 64-bit Node, and also added a some
monitoring that would alert me ASAP if any more core dumps showed up. For the
next week, none did.
…which I thought was because things were working fine. In reality, the lists
were completely down from then to today, but nobody alerted me because they
thought I already knew.
Ooops.
Haraka’s main process was running the whole time. But it would spawn a worker
process; that worker would complain about failing to load
/usr/lib/authbind/libauthbind.so.1, but treat this as a non-fatal error; then
it would try to bind on port 25, and fail because it can’t do so without
libauthbind. The worker process would die, and after a fraction of a second
Haraka would spawn a new one. This cycle repeated 716,838 times, printing
several dozen lines to a log file each time, until, on Saturday, the log file
filled up the VM’s remaining disk space. The lack of disk space triggered a
different monitoring alert, which caused me to finally investigate the
situation.
Once I determined the issue, the fix was simple: switch libauthbind to 64-bit
as well. Now the lists are back up.
I can’t guarantee they will stay up because I didn’t find the root cause of the
segfaults; they might recur. But the monitoring should work as intended now.
In case that doesn’t work, though, please let me know on Discord or IRC if the
lists go down. If the segfaults do recur, I will spend more time in the
debugger and hopefully get to the bottom of the situation.