(this was originally written for a FastMail internal mailing list, but it has
more general interest, so I'm posting it here too)
I figured I should write some of this up so we have a design we can all talk
about, and so we know what's roughly involved in getting there.
*The goal***
We'd love to be able to run no-raid IMAP servers, but that means that if we
lose a drive, we lose data unless we have a real time replica. There are
basically two ways to do this:
1) drbd (aka: cross-machine RAID!)
2) synchronous replication at the Cyrus level
We consider that drbd would be too poor performance (though this isn't tested)
and would mean tricky things for failover because our tooling isn't designed
for it, so we're looking at synchronous replication in Cyrus.
*Making replication more efficient***
Here's the current network traffic for 3 common scenarios: C is the client (aka
master) and S is the server (replica).
1) delivery of a new message to mailbox A:
C: GET MAILBOXES (A)
S: * MAILBOX (A) { header details }
S: OK
C: APPLY RESERVE (A) (GUID)
S: OK RESERVED ()
C: APPLY MESSAGE (GUID) {n+}
data
S: OK
C: APPLY MAILBOX A { header details } RECORD (%(UID GUID ...))
S: OK
2) move of a message from mailbox A to mailbox B
C: GET MAILBOXES (A B)
S: * MAILBOX (A) { header details }
S: * MAILBOX (B) { header details }
S: OK
C: APPLY RESERVE (A B) (GUID)
S: OK RESERVED (GUID)
C: APPLY MAILBOX B { header details } RECORD (%(UID GUID ...))
S: OK
3) touch a flag on a record (e.g. mark it seen as the owner - seen as non-owner
is more complex and involves an APPLY META)
C: GET MAILBOXES (A)
S: * MAILBOX (A) { header details }
S: OK
C: APPLY MAILBOX A { header details } RECORD (%(UID GUID FLAGS...))
S: OK
in order to speed all these up, we have proposed a replication cache. This
would know, for each mailbox, the last known state at the replica end (aka: the
GET MAILBOXES response)
so:
1) becomes
C: APPLY MESSAGE (GUID) {n+}
data
S: OK
C: APPLY MAILBOX A { header details } RECORD (%(UID GUID ...))
S: OK
2) becomes
C: APPLY MAILBOX B { header details } RECORD (%(UID GUID ...))
S: OK
and; 3) becomes
C: APPLY MAILBOX A { header details } RECORD (%(UID GUID FLAGS...))
S: OK
What happened to the "APPLY RESERVE" in (2) you ask? We do on-the-fly reserve
on the server by scanning through any new appends and looking for the record
using the conversations DB. Yep. And the client knows it's likely there by
checking createdmodseqs on existing records in ITS conversations DB, because we
can do that.
*Sanity checking***
If we're going to just be doing a direct APPLY command with a cached state,
then the existing race condition between the GET MAILBOXES and the APPLY
MAILBOX becomes much wider. This is fine if nothing else ever modifies your
replica, but sometimes things change. I already have a branch to address this
with the ifInState equivalent for the replication protocol, which is three new
keys in the APPLY MAILBOX: SINCE_MODSEQ, SINCE_CRC and SINCE_CRC_ANNOT, plus a
new error code. This branch is already done and passes tests.
*The cache***
This would be a per-channel twoskip file on tmpfs probably, into which a dlist
per mailbox would be stored.
The place to hook writing this would be after the OK response just before
'done:' in update_mailbox_once(). And also hook wiping it into any failure of
update_mailbox_once so that the retries could fix things up.
The place to hook reading would probably just be sync_do_mailboxes, where the
loop copying the sync_folder_list data into the MAILBOXES dlist could check the
cache for each name and just transcribe that into the replica_folders list, and
obviously not send a backend request if the entire list was satisfied from
cache. We could also write cache with this result of the MAILBOXES call just in
case it's already up to date and hence update_mailbox_once doesn't need to be
called!
*The new reserve***
Skipping the current reserve call requires making the sync_server able to use
its local conversations db to resolve GUIDs as needed from the conversations.db
and copy the files over. This should be viable using logic from the JMAP
bulk-update, but it's going to need to be copied, because the JMAP code
requires lots of logic which is deep inside the jmap modules, so we can't just
call it from the replication subsystem.
This saves roundtrips for the reserve call. It does depend on delayed expunge
to some level, but that's OK so long as it can recover from a "missing GUID on
the replica" failure because its estimate of what was in other folders was
wrong!
*A local lock***
While any sync_client is replicating a mailbox, it will need to take a local
lock on $SYNC-$channel-$mboxname! This is so that we don't have two processes
trying to sync the same channel at the same time! Rolling sync could try
non-blocking for that name and just punt it (aka: sync_log it to itself again)
if the lock is busy.
*The realtime sync***
And now we get to the meat of this! We want to allow multiple parallel
real-time sync calls, but probably not hundreds. I suggest that we use a
sync_client-style tool which runs as a daemon within cyrus, listening probably
on a UNIX socket and keeping a long running backend connection open, so there
may be a handful of syncd / sync_server parings running at any one time,
servicing random requests.
The config would be something like
sync_realtime_socket: [% confdir %]/socket/syncd
and in cyrus.conf
syncd cmd="syncd [% CONF %] -n [ %buddyslotchannel %]" listen="[% confdir
%]/socket/syncd"
(aka: use the named channel to choose the backend for realtime sync)
The client code would keep EXACTLY the current sync_log support to all
channels, but have the following very small modification added: at the end of
mailbox_unlock_index or mailbox_close or wherever seems the most sane, we add a
synchronous callout. It might even go up a layer, though there's many more
places to think about. Anyway, this callout connects to the local syncd on the
specified port and asks it to replicate just the one mailbox name to the
backend, and waits for the reply. If the reply is a failure, then that gets
syslogged, but otherwise the return is still success to the rest of the code.
The end result of hooking at this point is that EVERY protocol which replies
success after committing the changes will not reply back to the client until
the data is replicated to the buddy machine (buddy naming is hearkening back to
the short-lived career of Max at FastMail and his buddyslot idea!)
The rolling sync_client will pick up the logged name still, but the good news
is, it will either still be locked (local lock) and hence try later, or it will
already be up to date, and the local state will match the cached state, so
there's no network IO or heavy calculations at all!
This lack of network IO at all for the followup case is why I think the cache
and remote conversations-based GUID lookup (and local validation that it's
worth trying) is worth doing first, because that means that synchronous
replication stands a chance of being efficient enough to be sane!
I _think_ this design is good. Please let me know if anything about how it's
supposed to work is unclear, or if you have a better idea.
In terms of the split if work: I envisage Ken or ellie writing the syncd stuff.
I've already written the ifInState stuff, so I'll just ship that. The caching
could be anybody, it's pretty easy, and the GUID lookup for master/replica
could be me, or could be someone else if they're keen to have a look at it - I
have a very clear idea of how that would look.
Cheers,
Bron.
--
Bron Gondwana, CEO, FastMail Pty Ltd
[email protected]