First of all, before getting into the "what we need to do for 3.0" I want to 
wax philosophical for a moment...

Shit goes wrong.  All sorts of amazing things.

* The computer can crash at literally any moment during any action and any 
codepath
* The OS can re-order writes to disk in just about any way
* fsync can lie (* we can't do anything about this one)
* disks can fill up
* a partition can have wrong permissions on it, both at startup and randomly 
while things are running
* a partition can go missing / randomly be unmounted.
* the OS can randomly return a few bytes of zeros in the middle of your mmaped 
file:
  https://lkml.org/lkml/2008/6/17/9
* a multi-disk corruption can cause a random block of rubbish to appear within 
a file

Run a big enough set of servers for long enough, and you'll see all these 
things, whether due to admin error, or hardware failure...

Our job as developers of Cyrus IMAPd is to make sure that we cope with what we 
can, don't fail catastrophically, and make recovery as good as possible.

On the flip side, we don't want the admin to have to micro-manage everything.  
As much as possible, we don't want the abstraction of a reliable mail store to 
leak:

http://www.joelonsoftware.com/articles/LeakyAbstractions.html

So what we want to do for Cyrus 3.0 falls into three main buckets:

1) make things more robust/scalable.  That's all these things above, handle 
them cleanly or provide the best possible recovery path.
2) make Cyrus easier to run/administrate.  Things in this bucket include the 
authentication system, backups, moving users between servers, replication, etc
3) new features and standards support.  Things like object storage, external 
search engines, JMAP, sieve variables/date/etc.


So if we are proposing something which takes away an existing repair mechanism 
- for example you can rebuild mailboxes.db by walking the tree of directories 
right now, we'd better be proposing something just as recoverable, but better 
in some way as well - like adding the mailbox name (and past mailbox names...) 
to cyrus.header and then storing all the files with paths based on the 
UNIQUEID, which is a UUID, and doesn't contain weird characters, and has a 
fixed length.  So you don't have stupid things like mailbox names being 
constrained by the characters supported by your filesystem, and case 
significance, and you get fast renames... but you don't lose the ability to 
recover.

Checksums.  We sanity check almost everywhere, because you can't do a full 
system scan at startup, checking the sha1 of every single file, to make sure 
there has been no corruption.

We scan files at backup time.  We scan them during replication.  We need a tool 
which scans them from a cron job for people who want to check that... maybe 
reconstruct needs flags to say "check but don't change things", so you can run 
it from cron but not be afraid that it will run when your data drive has 
unmounted by accident and wipe out your entire cyrus.index because it can't 
find the spool files.

At FastMail we have a tool that can fetch a damaged file from its replica.  We 
need that in Cyrus - either the magic perl script, or better - something built 
in to a tool in C.  Ditto for many other FastMail specific external Perl 
utilities.

-----

So now we know what and why we're doing... here's my rough things that need 
doing:

* Mailbox transactions: avoid failures leaving mailboxes in corrupt state 
(might require 3-fsync commit, so we at least know if it's unfinished)
* UniqueId paths (described above)
* robust backup and restore tooling
* Replication based repair:
  a) replication and existing replica awareness in code
  b) replication based XFER (falls in with this)
  c) reconstruct support for checking replicas for files
  d) reconstruct sanity checking - are the spools broken, don't keep working
* files by sha1 rather than UID in mailboxes?  Means you can't rebuild in 
exactly the same order without cyrus.index, but if you've lost cyrus.index you 
may as well just sort them by date and then give the mailbox a new UIDVALIDITY 
anyway.
* mailboxes.db new key format - better sorting
* For performance at scale: reverse ACL map.
* For real reliability - synchronous replicas (falls out of awareness above)

* For general speed and also safety - central cleanup daemon: use the same 
logic we use for sync_client and (at FastMail) squatter indexing.  Changes to 
mailbox cause a log entry.  A daemon processes those logs, does cleanup tasks 
in the background.  During startup this file can be resolved - so half-finished 
renames can be found and finished or reverted - so long as we log intent before 
making changes.. actually, I really like this:

lock(mailbox);
sync_log(mailbox->name);
/* do stuff */
unlock(mailbox);

rather than the current:

lock(mailbox);
/* do stuff */
sync_log(mailbox->name);
unlock(mailbox);

And then all the task things do a trylock, and if it fails, they just insert 
the record into their source log file again.  That way, they retry them again 
in a moment (to avoid busywait, add a pause if you didn't process ANY changes 
this time around).  This makes sync not wait on tasks, yet intent get logged 
early, before changes are made, so we can never miss something because there 
was a crash before the commit finished and the event was logged.

* External system integration points
* OS packages
* Docker images / VMs (for production use)

I'll try to get this into Phab tickets tonight - just about to leave work now.

Bron.


-- 
  Bron Gondwana
  br...@fastmail.fm

Reply via email to