DIS: [Distributor] List postmortem

omd via agora-discussion Sun, 25 Oct 2020 21:04:07 -0700

From 10/17 to 10/19, the list VM's SMTP server experienced repeated, periodic 
segfaults.  This is despite the fact that the server software in question, 
Haraka, is written for Node.js (a memory safe environment) with no native 
extensions.   The segfaults would sometimes cause Mailman’s outgoing runner to 
get wedged (apparently it has no timeout), so new messages were added to the 
archives but weren’t delivered to anyone until I manually restarted Mailman, 
which I did a few times.


I honestly have no idea what caused the segfaults; opening the core dumps in a 
debugger was entirely unhelpful.  I don’t even see any package updates in the 
days before the crashes started.  However, I did learn that Node.js no longer 
officially supports 32-bit x86, which I had been running it on for... 
historical reasons.  On 10/19 I switched to 64-bit Node, and also added a some 
monitoring that would alert me ASAP if any more core dumps showed up.  For the 
next week, none did.

…which I thought was because things were working fine.  In reality, the lists 
were completely down from then to today, but nobody alerted me because they 
thought I already knew.

Ooops.

Haraka’s main process was running the whole time.  But it would spawn a worker 
process; that worker would complain about failing to load 
/usr/lib/authbind/libauthbind.so.1, but treat this as a non-fatal error; then 
it would try to bind on port 25, and fail because it can’t do so without 
libauthbind.  The worker process would die, and after a fraction of a second 
Haraka would spawn a new one.  This cycle repeated 716,838 times, printing 
several dozen lines to a log file each time, until, on Saturday, the log file 
filled up the VM’s remaining disk space.  The lack of disk space triggered a 
different monitoring alert, which caused me to finally investigate the 
situation.

Once I determined the issue, the fix was simple: switch libauthbind to 64-bit 
as well.  Now the lists are back up.

I can’t guarantee they will stay up because I didn’t find the root cause of the 
segfaults; they might recur.  But the monitoring should work as intended now.  
In case that doesn’t work, though, please let me know on Discord or IRC if the 
lists go down.  If the segfaults do recur, I will spend more time in the 
debugger and hopefully get to the bottom of the situation.

DIS: [Distributor] List postmortem

Reply via email to