(this was originally written for a FastMail internal mailing list, but it has 
more general interest, so I'm posting it here too)

I figured I should write some of this up so we have a design we can all talk 
about, and so we know what's roughly involved in getting there.

*The goal***

We'd love to be able to run no-raid IMAP servers, but that means that if we 
lose a drive, we lose data unless we have a real time replica. There are 
basically two ways to do this:

1) drbd (aka: cross-machine RAID!)
2) synchronous replication at the Cyrus level

We consider that drbd would be too poor performance (though this isn't tested) 
and would mean tricky things for failover because our tooling isn't designed 
for it, so we're looking at synchronous replication in Cyrus.

*Making replication more efficient***

Here's the current network traffic for 3 common scenarios: C is the client (aka 
master) and S is the server (replica).

1) delivery of a new message to mailbox A:

C: GET MAILBOXES (A)
S: * MAILBOX (A) { header details }
S: OK
C: APPLY RESERVE (A) (GUID)
S: OK RESERVED ()
C: APPLY MESSAGE (GUID) {n+}
data
S: OK
C: APPLY MAILBOX A { header details } RECORD (%(UID GUID ...))
S: OK

2) move of a message from mailbox A to mailbox B

C: GET MAILBOXES (A B)
S: * MAILBOX (A) { header details }
S: * MAILBOX (B) { header details }
S: OK
C: APPLY RESERVE (A B) (GUID)
S: OK RESERVED (GUID)
C: APPLY MAILBOX B { header details } RECORD (%(UID GUID ...))
S: OK

3) touch a flag on a record (e.g. mark it seen as the owner - seen as non-owner 
is more complex and involves an APPLY META)

C: GET MAILBOXES (A)
S: * MAILBOX (A) { header details }
S: OK
C: APPLY MAILBOX A { header details } RECORD (%(UID GUID FLAGS...))
S: OK

in order to speed all these up, we have proposed a replication cache. This 
would know, for each mailbox, the last known state at the replica end (aka: the 
GET MAILBOXES response)

so:

1) becomes
C: APPLY MESSAGE (GUID) {n+}
data
S: OK
C: APPLY MAILBOX A { header details } RECORD (%(UID GUID ...))
S: OK

2) becomes
C: APPLY MAILBOX B { header details } RECORD (%(UID GUID ...))
S: OK

and; 3) becomes
C: APPLY MAILBOX A { header details } RECORD (%(UID GUID FLAGS...))
S: OK

What happened to the "APPLY RESERVE" in (2) you ask? We do on-the-fly reserve 
on the server by scanning through any new appends and looking for the record 
using the conversations DB. Yep. And the client knows it's likely there by 
checking createdmodseqs on existing records in ITS conversations DB, because we 
can do that.

*Sanity checking***

If we're going to just be doing a direct APPLY command with a cached state, 
then the existing race condition between the GET MAILBOXES and the APPLY 
MAILBOX becomes much wider. This is fine if nothing else ever modifies your 
replica, but sometimes things change. I already have a branch to address this 
with the ifInState equivalent for the replication protocol, which is three new 
keys in the APPLY MAILBOX: SINCE_MODSEQ, SINCE_CRC and SINCE_CRC_ANNOT, plus a 
new error code. This branch is already done and passes tests.

*The cache***

This would be a per-channel twoskip file on tmpfs probably, into which a dlist 
per mailbox would be stored.

The place to hook writing this would be after the OK response just before 
'done:' in update_mailbox_once(). And also hook wiping it into any failure of 
update_mailbox_once so that the retries could fix things up.

The place to hook reading would probably just be sync_do_mailboxes, where the 
loop copying the sync_folder_list data into the MAILBOXES dlist could check the 
cache for each name and just transcribe that into the replica_folders list, and 
obviously not send a backend request if the entire list was satisfied from 
cache. We could also write cache with this result of the MAILBOXES call just in 
case it's already up to date and hence update_mailbox_once doesn't need to be 
called!

*The new reserve***

Skipping the current reserve call requires making the sync_server able to use 
its local conversations db to resolve GUIDs as needed from the conversations.db 
and copy the files over. This should be viable using logic from the JMAP 
bulk-update, but it's going to need to be copied, because the JMAP code 
requires lots of logic which is deep inside the jmap modules, so we can't just 
call it from the replication subsystem.

This saves roundtrips for the reserve call. It does depend on delayed expunge 
to some level, but that's OK so long as it can recover from a "missing GUID on 
the replica" failure because its estimate of what was in other folders was 
wrong!

*A local lock***

While any sync_client is replicating a mailbox, it will need to take a local 
lock on $SYNC-$channel-$mboxname! This is so that we don't have two processes 
trying to sync the same channel at the same time! Rolling sync could try 
non-blocking for that name and just punt it (aka: sync_log it to itself again) 
if the lock is busy.

*The realtime sync***

And now we get to the meat of this! We want to allow multiple parallel 
real-time sync calls, but probably not hundreds. I suggest that we use a 
sync_client-style tool which runs as a daemon within cyrus, listening probably 
on a UNIX socket and keeping a long running backend connection open, so there 
may be a handful of syncd / sync_server parings running at any one time, 
servicing random requests.

The config would be something like

sync_realtime_socket: [% confdir %]/socket/syncd

and in cyrus.conf

 syncd cmd="syncd [% CONF %] -n [ %buddyslotchannel %]" listen="[% confdir 
%]/socket/syncd"

(aka: use the named channel to choose the backend for realtime sync)

The client code would keep EXACTLY the current sync_log support to all 
channels, but have the following very small modification added: at the end of 
mailbox_unlock_index or mailbox_close or wherever seems the most sane, we add a 
synchronous callout. It might even go up a layer, though there's many more 
places to think about. Anyway, this callout connects to the local syncd on the 
specified port and asks it to replicate just the one mailbox name to the 
backend, and waits for the reply. If the reply is a failure, then that gets 
syslogged, but otherwise the return is still success to the rest of the code.

The end result of hooking at this point is that EVERY protocol which replies 
success after committing the changes will not reply back to the client until 
the data is replicated to the buddy machine (buddy naming is hearkening back to 
the short-lived career of Max at FastMail and his buddyslot idea!)

The rolling sync_client will pick up the logged name still, but the good news 
is, it will either still be locked (local lock) and hence try later, or it will 
already be up to date, and the local state will match the cached state, so 
there's no network IO or heavy calculations at all!

This lack of network IO at all for the followup case is why I think the cache 
and remote conversations-based GUID lookup (and local validation that it's 
worth trying) is worth doing first, because that means that synchronous 
replication stands a chance of being efficient enough to be sane!

I _think_ this design is good. Please let me know if anything about how it's 
supposed to work is unclear, or if you have a better idea.

In terms of the split if work: I envisage Ken or ellie writing the syncd stuff. 
I've already written the ifInState stuff, so I'll just ship that. The caching 
could be anybody, it's pretty easy, and the GUID lookup for master/replica 
could be me, or could be someone else if they're keen to have a look at it - I 
have a very clear idea of how that would look.

Cheers,

Bron.

--
 Bron Gondwana, CEO, FastMail Pty Ltd
 br...@fastmailteam.com

Reply via email to