From 10/17 to 10/19, the list VM's SMTP server experienced repeated, periodic segfaults. This is despite the fact that the server software in question, Haraka, is written for Node.js (a memory safe environment) with no native extensions. The segfaults would sometimes cause Mailman’s outgoing runner to get wedged (apparently it has no timeout), so new messages were added to the archives but weren’t delivered to anyone until I manually restarted Mailman, which I did a few times.
I honestly have no idea what caused the segfaults; opening the core dumps in a debugger was entirely unhelpful. I don’t even see any package updates in the days before the crashes started. However, I did learn that Node.js no longer officially supports 32-bit x86, which I had been running it on for... historical reasons. On 10/19 I switched to 64-bit Node, and also added a some monitoring that would alert me ASAP if any more core dumps showed up. For the next week, none did. …which I thought was because things were working fine. In reality, the lists were completely down from then to today, but nobody alerted me because they thought I already knew. Ooops. Haraka’s main process was running the whole time. But it would spawn a worker process; that worker would complain about failing to load /usr/lib/authbind/libauthbind.so.1, but treat this as a non-fatal error; then it would try to bind on port 25, and fail because it can’t do so without libauthbind. The worker process would die, and after a fraction of a second Haraka would spawn a new one. This cycle repeated 716,838 times, printing several dozen lines to a log file each time, until, on Saturday, the log file filled up the VM’s remaining disk space. The lack of disk space triggered a different monitoring alert, which caused me to finally investigate the situation. Once I determined the issue, the fix was simple: switch libauthbind to 64-bit as well. Now the lists are back up. I can’t guarantee they will stay up because I didn’t find the root cause of the segfaults; they might recur. But the monitoring should work as intended now. In case that doesn’t work, though, please let me know on Discord or IRC if the lists go down. If the segfaults do recur, I will spend more time in the debugger and hopefully get to the bottom of the situation.