From the ‘why don’t you just leave it alone’ department: On 12/14, I upgraded some stuff on the server that hosts the lists, which I noted at the time caused a list outage for a few hours.
On 12/31 (yesterday) I was alerted on Discord that the web interface for the lists was broken, which among other things made it impossible for new players to subscribe. Since I’m in vacation mode, I wasn’t on my computer for most of the day, so I didn’t get the message for about 12 hours (until about 1/1 04:30 UTC). I investigated and discovered that at some point on 12/14 I had accidentally wiped out mm_cfg.py, the Mailman configuration file. Email functionality was unaffected because Mailman handles it using persistently-running daemons, which load the configuration file on startup, and I hadn’t restarted them since before deleting the file. But since Mailman 2 is ancient, the *web* interface is based on CGI: every request spawns a new process which loads the configuration file from scratch. Thus it broke as soon as the file was deleted. Luckily, I have daily backups of everything on the server! Unluckily, being a cheapskate, I’ve been storing those backups in S3 Glacier Deep Archive, which is a quarter of the price of other S3 storage options, but makes you wait 12 hours between initiating a restore request and being able to access the data. So I went ahead and initiated a request, but in the meantime had to use some debugger silliness to partially reconstruct mm_cfg.py from the daemons which were still in memory. Then I went to bed. That request has now completed, so I was able to recover the old mm_cfg.py and all should be well. Now, how to prevent this from happening in the future? First, to guard against this particular type of issue, where Mailman is throwing errors but I’m unaware of it, I added a CloudWatch alarm for when the size of the error log increases. Beyond that… On Discord, Aris and Gaelan suggested keeping a subset of backups in a storage service that doesn’t have such a high retrieval time. This is probably a good idea, but harder than it sounds. S3 does have built-in support for “lifecycle rules” where objects can be transitioned between different storage classes (e.g. normal, Glacier, Glacier Deep Archive) a specified number of days after they’re created. This can be configured to affect all objects, or only non-current versions of objects. However, the latter only works if you’re using S3 versioning. For now, I’ve been using Restic, which handles versioning itself, on top of the storage layer. And since it makes incremental backups, each backup job only uploads newly created data. Thus, lifecycle rules can’t differentiate between data that only exists in old backups, on one hand, and data that’s still current but hasn’t been changed recently, on the other. Both would be considered old and moved to Deep Archive. But the file I needed to restore this time, the pre-deletion version of mm_cfg.py, was in the latter class. The ideal approach would be for Restic to handle the storage class migration itself as well, but it doesn’t support that, and I don’t know of any Linux backup software that does. S3 versioning could theoretically work – but on one hand, Glacier has per-object charges, which suggests bundling files into large objects, while on the other hand, S3 versioning stores each version of an object separately rather than using delta diffs, so large objects which are repeatedly partially changed will incur storage costs for each version. Seems inefficient. Another option would be to have two completely separate backups: one for “backups from every day, stored forever”, on Deep Archive, and another for “backups every so often, stored for a month”, not on Deep Archive. I might do that at some point. But the data on the server doesn’t change that much (plus I just started a new backup series after switching to Restic), so at least for the time being, saving money on old data might not be worth having to pay twice for the current data. Also, this is the first time I’ve /ever/ had to restore from these backups, so I am tempted to just live with the 12-hour wait and try not to screw up in the future. But ultimately, there isn’t actually that much data to store. The server is currently using only 59GB of space (most of which isn’t even Agora-related). My old backup series for the server covering every day from 08/2015 to 10/2020 is only 167GB. Honestly, the reason I’m so paranoid about costs is mostly that I’m also using S3 to back up personal computers which use *much* more disk space. But there’s no reason I can’t treat the server differently, besides habit. So for now, I just switched the lifecycle rule to use regular Glacier (data available in 3 to 5 hours or faster for a price) instead of Glacier Deep Archive. I also set it to transition objects to Glacier 30 days after being uploaded instead of 1 day, though I don’t think this will make much difference. If anyone has any better suggestions, please let me know!