Cyrus message file race conditions and failure modes

Bron Gondwana Sun, 21 Jun 2015 00:44:42 -0700

Obviously with computers, there can be a failure at any point.  Cyrus is very 
good about not returning a success code (LMTP or IMAP) until the mailbox is 
entirely written into a consistent state... but there are potential failures 
along the way.


The cyrus.index file in particular has no multi-stage commit, so it's possible 
for it to be corrupted, but that's an incredibly small window these days, 
because we store up the entire transaction's worth of changes in memory before 
writing anything, and then sort it by UID and stream it to disk before fsyncing 
the file.  This is _commit_changes and _commit_one in imap/mailbox.c.

You can read a bit more about how we avoid that hole with future pointers and 
three-fsync transactions in twoskip at 
http://opera.brong.fastmail.fm.user.fm/talks/twoskip/ - the ODP file has both 
the slides and the speaker's notes.  Ideally any rewrite of cyrus.index would 
include a similar journal-based method of protecting the file from corruption.

Anyway, let's talk about message delivery.

The first part of any message being delivered to a mailbox, either via IMAP 
append or LMTP delivery is the "stage." directory.  This is in the data 
partition (not yet ever the archive partition - but that's because I haven't 
written it - ideally it would spool the first megabyte to $spool/stage. and 
then copy it to $archive/stage. when it gets too big - or even just spool to 
memory for the first megabyte and then down to archive... hmm), but I'm 
rambling.

The point is, it's always "$spool/stage./$pid-$internaldate-$messagenum" where 
message files are written right now.  This is created by append_newstage in 
imap/append.c.  You'll note when reading it that the specific spool partition 
is chosen based on the mailbox name by mboxlist_findstage.  The mailbox itself 
was given to the append object in append_setup or append_setup_mbox.

stage->parts allows the same file to be copied to multiple separate partitions 
inside their own stage. directory later, which is used for 
single-instance-store when the same message is delivered to hundreds or even 
thousands of users on a single server with potentially many partitions.  Each 
partition has only a single copy of the message which is then hardlinked.

So any crash during the upload stage will leave a file in $spool/stage.  At 
FastMail, we wipe that directory on server start, otherwise we don't touch it.  
That's fine, because crashes are rare.

ASIDE: at the moment, replication puts messages into $spool/sync. instead, with 
a different naming scheme.  Eventually I want to merge the two upload methods 
to create a single location where everything goes, no matter which source - and 
have it parse the message once and then move it to a filename based on the 
GUID, and build the cache structure in memory just once by parsing it.  Yet 
another combining of codepaths eventually :)

Next is append_fromstage, which does the actual adding the message to a folder. 
 Both this and append_copy actually link messages directly into the mailbox 
directories with "$spool/u/user/$uid." filenames.  This is done with 
mailbox_copyfile, which makes hard links if it can, and falls back to a 
streaming write otherwise.

So the copies of the files are always done first, then the cache record is 
created second, either with mailbox_append_cache or during 
mailbox_append_index_record, which will write cache before writing the record 
to cyrus.index.

Finally, append_commit will call mailbox_commit and clean up.

append_abort is supposed to roll back, but it's not entirely clean and able to 
do so at the moment.

Anyway, a failure can cause $uid. files to exist that are in the future.  A 
reconstruct will discover these, and used to always add them to the mailbox.  I 
think it can delete them instead now, since chances are that the message will 
be delivered again in a second anway.

If you don't run reconstruct, then they will be deleted and replaced when 
another message gets that UID.

Next up, the cache file gets written.  If there's a failure after this, it's 
fine - it just means there is stale data in the cache with nothing pointing to 
it.  Next time the cache file gets rewritten (malibox_rewrite_index) then those 
bytes won't be copied, and will disappear.

So those are the possible failure modes during append.
* spool file only
* spool file and cyrus.cache entry
* spool file, cyrus.cache entry, cyrus.index record, but the header wasn't 
updated correctly (invalid modseqs, uidnext not updated, messy) - or indeed the 
other way around here, since there's no intermediate fsync - data could be 
written in either order.
* completely correct append, but the client was never told.

The last case, the record exists, just the client will try to upload it again - 
either IMAP or LMTP delivery retries.

...

So that's arrival.  There's also cleanup. unlink() is very expensive, so we do 
it outside the lock.  During mailbox_index_unlock, which is where most of the 
interesting stuff about updating statuscache, writing data, etc happens.  
During a repack or other unlink event, we keep a list of the messages to 
unlink, and we delete them later.

As above, the file on disk exists until AFTER the reference has been safely 
removed and the result of the removal fsynced to disk.  A crash or even just 
forced shutdown during this can leave messages lying around in the spool.

Again, reconstruct should be able to clean these up.  Another option would be 
to have a two-phase cleanup, which is safer.  Update the record to say "safe to 
unlink" but don't remove the record from cyrus.index, then unlink it, then NEXT 
pass, if the file doesn't exist, it's safe to remove from the cyrus.index.  
(NOTE: if you don't understand about FLAG_EXPUNGED and how it works as a 
tombstone record for QRESYNC, it might be worth grepping around for it and 
seeing how it works - leaving a FLAG_EXPUNGED record in cyrus.index a little 
longer doesn't change what's visible to IMAP, only slows things slightly)

...

So in terms of hooking a smarter object store in here, we need a way to replace 
mailbox_copyfile with a record-aware copyfile, that can be used both for new 
messages from spool, and for existing messages being linked by append_copy.  
The append_copy case will be either a noop or a refcount operation in a smart 
system, and the mailbox_copyfile one will know which partition to store the 
data on.

I'm actually thinking here that the sensible approach to single-instance store, 
rather than one copy per partition, is to have a list of known partitions 
somewhere mapping name to integer ID, and store the partition ID with EVERY 
SINGLE cyrus.index record.  That way partitions are per-record not per-mailbox, 
and you can easily implement archive partition on top of it :)  OK, that's a 
brand new idea, so I'll let it settle for a bit.

Apologies for rambling.  Hopefully this is useful for code navigation.  Enjoy,

Bron.

-- 
  Bron Gondwana
  br...@fastmail.fm

Cyrus message file race conditions and failure modes

Reply via email to