Hi all,

As you may know, the lists were down.

The issue was trivial, but I didn’t notice, so I didn’t get them back up for a 
full week.

Timeline of events:
- 2023-10-22: Server reboots uncleanly (why?); after rebooting, mailman doesn’t 
come back up due to a stale lockfile
- ??: Janet Cobb notifies me on Discord (I don’t see this message; perhaps it 
was sent to the wrong user?)
- 2023-10-26: Janet Cobb notifies me by email (but I didn’t notice it; I see it 
now though)
- 2023-10-27: Late at night I notice due to ALT messages ending up in my inbox
- 2023-10-28: I procrastinate on dealing with the issue
- 2023-10-29: Janet Cobb notifies me on Mastodon, and I fix the issue

I did take some actions to prevent this from happening again:
- Changed systemd configuration to ask mailmanctl to automatically clean up 
stale locks.
- Added a CloudWatch alarm that specifically checks whether mailman qrunner 
processes are running.  The issue actually triggered my existing alarm for any 
errors being logged in the Mailman log, but there are spurious error logs often 
enough that I’ve been too lazy to check up on it.  The new alarm is less broad 
but also less prone to false positives.

However…

You all might want to consider the possibility of moving to groups.io.  Don’t 
get me wrong, I’m happy to continue running the lists for another 10 years and 
beyond.  But I have definitely been neglecting proper maintenance and 
monitoring, and that neglect will probably continue, leading to the possibility 
of more outages like this.

Up to you!

- omd
  • DIS: welp omd via agora-discussion

Reply via email to