Le 12/08/2011 13:37, Bron Gondwana a écrit :
On Fri, 12 Aug 2011 11:16:30 +0200, Julien Coloos <julien.col...@atos.net> wrote:

1) Master-master replication
There are many difficulties to overcome for this feature, as well as many ways to achieve it:
   - 2-servers only, multi-master, ... ?

N servers - mesh, chain, whatever.  Ideally any layout will work.

My plan (which is partially done) is named channels for sync - basically
the name of the server you're connecting to.  When you make a change,
it injects a record into the sync log for each channel you have configured.
The sync_client for that channel then runs a reconciliation of every
listed item.


Nice, no wonder I though that this channel stuff I saw in the source code could be used to separate logs for different servers, since that was your intent :) And proceeding that way allows to create mesh/chain topologies depending on the configuration.

If sync_server makes a change, is should suppress the sync_log for the
channel on which the change came in - so it doesn't try to replicate
back.  But it doesn't hurt too badly, because the data will be in sync
anyway, in theory...


Yeah, thinking about it it's not easy to prevent unnecessary reconciliations when topologies get complex. Fortunately propagation between servers quickly ends once they see they are in sync. And people who do care could use a chain topology to limit this.

   - optimistic replication ?

There are some things that I want to add to this, in particular a
"replication cache" which remembers the state of the mailbox at
last sync, allowing a sync_client to optimistically generate a
"change request" to the sync_server, listing all the changes that
would bring the target mailbox up-to-date, and also listing the
current state it EXPECTS on the server.  That way we can check
up-front if anything on the replica side has already changed, and
fall back to a full replication run.

The "full replication run" would first reconcile everything past the
last known "in sync" modseq, and only if that failed would it
send the entire mailbox contents.


Would that cache be per-server, or shared on the platform (kinda like the MUPDATE server) ?

- since most meta-data do not (yet ?) have some kind of revision control, how to reconciliate when concurrent changes happen ?

Add revision control to everything.  In particular:

a) modseq to all annotations
b) keep DELETED entries in mailboxes.db with the UIDVALIDITY of
   the deleted mailbox (also on moves)
c) version numbers and "deleted markers" for all sieve scripts


Fair enough.
b is indeed useful to determine whether the mailbox was created on replica while master was down (needs to be copied back), or if it was deleted on master and was not yet pushed to replica. A little bit curious about those "deleted markers" :) What do you have in mind ?

-> actually, should data be allowed to diverge naturally ? that is, are users allowed to hit any master for the mailbox, or should they be directed to *the* master reference of the mailbox (other masters being hit only when the reference is down)
   - ...

They should be directed to the primary copy - IMAP and eventual
consistency are not friends.  You can make it work with UID promotion
logic, but flag merging will always be an inexact science unless we
keep MODSEQ and LASTMODIFIED separately for every single flag, and
the UID promotion still sucks when it happens to you.


I also think that this would be easier to handle than trying to manage concurrent modifications on all replicas as the nominal case (thinking about it already give me headaches). Guess MODSEQ and LASTMODIFIED for every flag would be a little overkill to achieve better merging.

Talking about UID promotion (great idea by the way :)), we wonder if there could be some gain using a centralised service (ala MUPDATE; hence my previous question about the "replication cache") which would be used to store and give the next UID of mailboxes. This could prevent UIDs collisions, replacing UID promotion need in general, and should have less impact on IMAP clients with agressive caching.

2) Dedicated (list of) replication server(s) per mailbox
For example there is the default server-wide replication configuration, but a new per-mailbox annotation could override that global setting. One of the advantages of this feature is to allow to put the load on more than one server (which is the replica in current situation) when a master is down.

I want to do this in mailboxes.db.  It's a lot of work, but extending the
murder syntax to allow multiple "locations" to be specified for a mailbox
including listing the preferred primary would allow all this to fit in
naturally with murder. And it would ROCK for XFER, because you would just
add the mailbox to a new server, replicate it, set the other end as the
preferred master, and then once all clients had finished their business
and been moved over, you could clean up the source server by removing the
mailbox from its list.

This is kind of the holy-grail of replication.


Eh, didn't think about murder. Then yes it would be more logical to have it in mailboxes.db.

3) Handling messages partition that actually do not need replication
For example:
- meta-data files are stored on local partitions, for faster access, and do need to be replicated - messages are stored on partitions shared with other masters and do not need to be replicated; think NFS access (or similar), with data already secured (e.g. RAID or proper filesystem replication) That's a peculiar scenario, and we know that feature may be a little tricky (even more if trying to manage it in a generic way, taking into account each kind of data/meta-data). But maybe it would benefit to people other than us ?

Goodness me.  I hadn't even thought of this.  Yes, it could be done.
Actually the easiest way would be to check if the message already
exists in the spool before replicating the data contents across.


Yes, and maybe also when receiving a new message on the replica while the master is down: if a file corresponding to the UID is already there, we could assume it is actually already used and will come later from the concerned server (unless UIDs are given by a central service). Well, that would certainly be harder than that to handle, but those are already a few pointers.

I can see the use-case.  Very interesting.  Mmm.  It should be fairly
easy to add a flag (possibly per partition/channel pair) for this.

In our case such a scenario would actually make sense, notably considering the number of mailboxes (tens of millions). Different people have different needs for their architecture, and the more possibilities offered to cyrus users, the better. But not everything can be implemented, and we guess choices will have to be made.

Comments and inputs are welcomed :)

Let me know you need anything clarified in what I wrote there - that's
kind of "off the cuff", but it's stuff I've been thinking about for a
long time.

Bron.


That's indeed really well though :) Thanks for your answers.
And as explained in the initial mail we would be happy to help you where possible, since we do have similar goals here.


Regards
Julien

Reply via email to