Re: Future plans for replication

Julien Coloos Fri, 12 Aug 2011 08:59:14 -0700

Le 12/08/2011 13:37, Bron Gondwana a écrit :

On Fri, 12 Aug 2011 11:16:30 +0200, Julien Coloos<julien.col...@atos.net> wrote:
1) Master-master replication
There are many difficulties to overcome for this feature, as well asmany ways to achieve it:
   - 2-servers only, multi-master, ... ?
N servers - mesh, chain, whatever.  Ideally any layout will work.

My plan (which is partially done) is named channels for sync - basically
the name of the server you're connecting to.  When you make a change,
it injects a record into the sync log for each channel you haveconfigured.
The sync_client for that channel then runs a reconciliation of every
listed item.

Nice, no wonder I though that this channel stuff I saw in the sourcecode could be used to separate logs for different servers, since thatwas your intent :)And proceeding that way allows to create mesh/chain topologies dependingon the configuration.

If sync_server makes a change, is should suppress the sync_log for the
channel on which the change came in - so it doesn't try to replicate
back.  But it doesn't hurt too badly, because the data will be in sync
anyway, in theory...

Yeah, thinking about it it's not easy to prevent unnecessaryreconciliations when topologies get complex. Fortunately propagationbetween servers quickly ends once they see they are in sync. And peoplewho do care could use a chain topology to limit this.

   - optimistic replication ?


There are some things that I want to add to this, in particular a
"replication cache" which remembers the state of the mailbox at
last sync, allowing a sync_client to optimistically generate a
"change request" to the sync_server, listing all the changes that
would bring the target mailbox up-to-date, and also listing the
current state it EXPECTS on the server.  That way we can check
up-front if anything on the replica side has already changed, and
fall back to a full replication run.

The "full replication run" would first reconcile everything past the
last known "in sync" modseq, and only if that failed would it
send the entire mailbox contents.

Would that cache be per-server, or shared on the platform (kinda likethe MUPDATE server) ?

- since most meta-data do not (yet ?) have some kind of revisioncontrol, how to reconciliate when concurrent changes happen ?
Add revision control to everything.  In particular:

a) modseq to all annotations
b) keep DELETED entries in mailboxes.db with the UIDVALIDITY of
   the deleted mailbox (also on moves)
c) version numbers and "deleted markers" for all sieve scripts


Fair enough.

b is indeed useful to determine whether the mailbox was created onreplica while master was down (needs to be copied back), or if it wasdeleted on master and was not yet pushed to replica.A little bit curious about those "deleted markers" :) What do you havein mind ?

-> actually, should data be allowed to diverge naturally ? thatis, are users allowed to hit any master for the mailbox, or shouldthey be directed to *the* master reference of the mailbox (othermasters being hit only when the reference is down)
   - ...
They should be directed to the primary copy - IMAP and eventual
consistency are not friends.  You can make it work with UID promotion
logic, but flag merging will always be an inexact science unless we
keep MODSEQ and LASTMODIFIED separately for every single flag, and
the UID promotion still sucks when it happens to you.

I also think that this would be easier to handle than trying to manageconcurrent modifications on all replicas as the nominal case (thinkingabout it already give me headaches).Guess MODSEQ and LASTMODIFIED for every flag would be a little overkillto achieve better merging.

Talking about UID promotion (great idea by the way :)), we wonder ifthere could be some gain using a centralised service (ala MUPDATE; hencemy previous question about the "replication cache") which would be usedto store and give the next UID of mailboxes. This could prevent UIDscollisions, replacing UID promotion need in general, and should haveless impact on IMAP clients with agressive caching.

2) Dedicated (list of) replication server(s) per mailbox
For example there is the default server-wide replicationconfiguration, but a new per-mailbox annotation could override thatglobal setting.One of the advantages of this feature is to allow to put the load onmore than one server (which is the replica in current situation) whena master is down.
I want to do this in mailboxes.db.  It's a lot of work, but extending the
murder syntax to allow multiple "locations" to be specified for a mailbox
including listing the preferred primary would allow all this to fit in
naturally with murder. And it would ROCK for XFER, because you wouldjust
add the mailbox to a new server, replicate it, set the other end as the
preferred master, and then once all clients had finished their business
and been moved over, you could clean up the source server by removing the
mailbox from its list.

This is kind of the holy-grail of replication.

Eh, didn't think about murder. Then yes it would be more logical to haveit in mailboxes.db.

3) Handling messages partition that actually do not need replication
For example:
- meta-data files are stored on local partitions, for fasteraccess, and do need to be replicated- messages are stored on partitions shared with other masters anddo not need to be replicated; think NFS access (or similar), withdata already secured (e.g. RAID or proper filesystem replication)That's a peculiar scenario, and we know that feature may be a littletricky (even more if trying to manage it in a generic way, takinginto account each kind of data/meta-data). But maybe it would benefitto people other than us ?
Goodness me.  I hadn't even thought of this.  Yes, it could be done.
Actually the easiest way would be to check if the message already
exists in the spool before replicating the data contents across.

Yes, and maybe also when receiving a new message on the replica whilethe master is down: if a file corresponding to the UID is already there,we could assume it is actually already used and will come later from theconcerned server (unless UIDs are given by a central service). Well,that would certainly be harder than that to handle, but those arealready a few pointers.

I can see the use-case.  Very interesting.  Mmm.  It should be fairly
easy to add a flag (possibly per partition/channel pair) for this.
In our case such a scenario would actually make sense, notablyconsidering the number of mailboxes (tens of millions).Different people have different needs for their architecture, and themore possibilities offered to cyrus users, the better. But noteverything can be implemented, and we guess choices will have to bemade.
Comments and inputs are welcomed :)
Let me know you need anything clarified in what I wrote there - that's
kind of "off the cuff", but it's stuff I've been thinking about for a
long time.

Bron.


That's indeed really well though :) Thanks for your answers.

And as explained in the initial mail we would be happy to help you wherepossible, since we do have similar goals here.



Regards
Julien

Re: Future plans for replication

Reply via email to