DIS: [Distributor] list subscription issue postmortem

omd via agora-discussion Fri, 01 Jan 2021 11:23:23 -0800

From the ‘why don’t you just leave it alone’ department:

On 12/14, I upgraded some stuff on the server that hosts the lists, which I 
noted at the time caused a list outage for a few hours.


On 12/31 (yesterday) I was alerted on Discord that the web interface for the 
lists was broken, which among other things made it impossible for new players 
to subscribe.  Since I’m in vacation mode, I wasn’t on my computer for most of 
the day, so I didn’t get the message for about 12 hours (until about 1/1 04:30 
UTC).

I investigated and discovered that at some point on 12/14 I had accidentally 
wiped out mm_cfg.py, the Mailman configuration file.  Email functionality was 
unaffected because Mailman handles it using persistently-running daemons, which 
load the configuration file on startup, and I hadn’t restarted them since 
before deleting the file.  But since Mailman 2 is ancient, the *web* interface 
is based on CGI: every request spawns a new process which loads the 
configuration file from scratch.  Thus it broke as soon as the file was deleted.

Luckily, I have daily backups of everything on the server!  Unluckily, being a 
cheapskate, I’ve been storing those backups in S3 Glacier Deep Archive, which 
is a quarter of the price of other S3 storage options, but makes you wait 12 
hours between initiating a restore request and being able to access the data.  
So I went ahead and initiated a request, but in the meantime had to use some 
debugger silliness to partially reconstruct mm_cfg.py from the daemons which 
were still in memory.  Then I went to bed.

That request has now completed, so I was able to recover the old mm_cfg.py and 
all should be well.

Now, how to prevent this from happening in the future?

First, to guard against this particular type of issue, where Mailman is 
throwing errors but I’m unaware of it, I added a CloudWatch alarm for when the 
size of the error log increases.

Beyond that…

On Discord, Aris and Gaelan suggested keeping a subset of backups in a storage 
service that doesn’t have such a high retrieval time.  This is probably a good 
idea, but harder than it sounds.  S3 does have built-in support for “lifecycle 
rules” where objects can be transitioned between different storage classes 
(e.g. normal, Glacier, Glacier Deep Archive) a specified number of days after 
they’re created.  This can be configured to affect all objects, or only 
non-current versions of objects.  However, the latter only works if you’re 
using S3 versioning.

For now, I’ve been using Restic, which handles versioning itself, on top of the 
storage layer.  And since it makes incremental backups, each backup job only 
uploads newly created data.  Thus, lifecycle rules can’t differentiate between 
data that only exists in old backups, on one hand, and data that’s still 
current but hasn’t been changed recently, on the other.  Both would be 
considered old and moved to Deep Archive.  But the file I needed to restore 
this time, the pre-deletion version of mm_cfg.py, was in the latter class.

The ideal approach would be for Restic to handle the storage class migration 
itself as well, but it doesn’t support that, and I don’t know of any Linux 
backup software that does.  S3 versioning could theoretically work – but on one 
hand, Glacier has per-object charges, which suggests bundling files into large 
objects, while on the other hand, S3 versioning stores each version of an 
object separately rather than using delta diffs, so large objects which are 
repeatedly partially changed will incur storage costs for each version.  Seems 
inefficient.

Another option would be to have two completely separate backups: one for 
“backups from every day, stored forever”, on Deep Archive, and another for 
“backups every so often, stored for a month”, not on Deep Archive.  I might do 
that at some point.  But the data on the server doesn’t change that much (plus 
I just started a new backup series after switching to Restic), so at least for 
the time being, saving money on old data might not be worth having to pay twice 
for the current data.

Also, this is the first time I’ve /ever/ had to restore from these backups, so 
I am tempted to just live with the 12-hour wait and try not to screw up in the 
future.

But ultimately, there isn’t actually that much data to store.  The server is 
currently using only 59GB of space (most of which isn’t even Agora-related).  
My old backup series for the server covering every day from 08/2015 to 10/2020 
is only 167GB.  Honestly, the reason I’m so paranoid about costs is mostly that 
I’m also using S3 to back up personal computers which use *much* more disk 
space.  But there’s no reason I can’t treat the server differently, besides 
habit.

So for now, I just switched the lifecycle rule to use regular Glacier (data 
available in 3 to 5 hours or faster for a price) instead of Glacier Deep 
Archive.  I also set it to transition objects to Glacier 30 days after being 
uploaded instead of 1 day, though I don’t think this will make much difference.

If anyone has any better suggestions, please let me know!

DIS: [Distributor] list subscription issue postmortem

Reply via email to