Le 12/08/2011 13:37, Bron Gondwana a écrit :
On Fri, 12 Aug 2011 11:16:30 +0200, Julien Coloos
<julien.col...@atos.net> wrote:
1) Master-master replication
There are many difficulties to overcome for this feature, as well as
many ways to achieve it:
- 2-servers only, multi-master, ... ?
N servers - mesh, chain, whatever. Ideally any layout will work.
My plan (which is partially done) is named channels for sync - basically
the name of the server you're connecting to. When you make a change,
it injects a record into the sync log for each channel you have
configured.
The sync_client for that channel then runs a reconciliation of every
listed item.
Nice, no wonder I though that this channel stuff I saw in the source
code could be used to separate logs for different servers, since that
was your intent :)
And proceeding that way allows to create mesh/chain topologies depending
on the configuration.
If sync_server makes a change, is should suppress the sync_log for the
channel on which the change came in - so it doesn't try to replicate
back. But it doesn't hurt too badly, because the data will be in sync
anyway, in theory...
Yeah, thinking about it it's not easy to prevent unnecessary
reconciliations when topologies get complex. Fortunately propagation
between servers quickly ends once they see they are in sync. And people
who do care could use a chain topology to limit this.
- optimistic replication ?
There are some things that I want to add to this, in particular a
"replication cache" which remembers the state of the mailbox at
last sync, allowing a sync_client to optimistically generate a
"change request" to the sync_server, listing all the changes that
would bring the target mailbox up-to-date, and also listing the
current state it EXPECTS on the server. That way we can check
up-front if anything on the replica side has already changed, and
fall back to a full replication run.
The "full replication run" would first reconcile everything past the
last known "in sync" modseq, and only if that failed would it
send the entire mailbox contents.
Would that cache be per-server, or shared on the platform (kinda like
the MUPDATE server) ?
- since most meta-data do not (yet ?) have some kind of revision
control, how to reconciliate when concurrent changes happen ?
Add revision control to everything. In particular:
a) modseq to all annotations
b) keep DELETED entries in mailboxes.db with the UIDVALIDITY of
the deleted mailbox (also on moves)
c) version numbers and "deleted markers" for all sieve scripts
Fair enough.
b is indeed useful to determine whether the mailbox was created on
replica while master was down (needs to be copied back), or if it was
deleted on master and was not yet pushed to replica.
A little bit curious about those "deleted markers" :) What do you have
in mind ?
-> actually, should data be allowed to diverge naturally ? that
is, are users allowed to hit any master for the mailbox, or should
they be directed to *the* master reference of the mailbox (other
masters being hit only when the reference is down)
- ...
They should be directed to the primary copy - IMAP and eventual
consistency are not friends. You can make it work with UID promotion
logic, but flag merging will always be an inexact science unless we
keep MODSEQ and LASTMODIFIED separately for every single flag, and
the UID promotion still sucks when it happens to you.
I also think that this would be easier to handle than trying to manage
concurrent modifications on all replicas as the nominal case (thinking
about it already give me headaches).
Guess MODSEQ and LASTMODIFIED for every flag would be a little overkill
to achieve better merging.
Talking about UID promotion (great idea by the way :)), we wonder if
there could be some gain using a centralised service (ala MUPDATE; hence
my previous question about the "replication cache") which would be used
to store and give the next UID of mailboxes. This could prevent UIDs
collisions, replacing UID promotion need in general, and should have
less impact on IMAP clients with agressive caching.
2) Dedicated (list of) replication server(s) per mailbox
For example there is the default server-wide replication
configuration, but a new per-mailbox annotation could override that
global setting.
One of the advantages of this feature is to allow to put the load on
more than one server (which is the replica in current situation) when
a master is down.
I want to do this in mailboxes.db. It's a lot of work, but extending the
murder syntax to allow multiple "locations" to be specified for a mailbox
including listing the preferred primary would allow all this to fit in
naturally with murder. And it would ROCK for XFER, because you would
just
add the mailbox to a new server, replicate it, set the other end as the
preferred master, and then once all clients had finished their business
and been moved over, you could clean up the source server by removing the
mailbox from its list.
This is kind of the holy-grail of replication.
Eh, didn't think about murder. Then yes it would be more logical to have
it in mailboxes.db.
3) Handling messages partition that actually do not need replication
For example:
- meta-data files are stored on local partitions, for faster
access, and do need to be replicated
- messages are stored on partitions shared with other masters and
do not need to be replicated; think NFS access (or similar), with
data already secured (e.g. RAID or proper filesystem replication)
That's a peculiar scenario, and we know that feature may be a little
tricky (even more if trying to manage it in a generic way, taking
into account each kind of data/meta-data). But maybe it would benefit
to people other than us ?
Goodness me. I hadn't even thought of this. Yes, it could be done.
Actually the easiest way would be to check if the message already
exists in the spool before replicating the data contents across.
Yes, and maybe also when receiving a new message on the replica while
the master is down: if a file corresponding to the UID is already there,
we could assume it is actually already used and will come later from the
concerned server (unless UIDs are given by a central service). Well,
that would certainly be harder than that to handle, but those are
already a few pointers.
I can see the use-case. Very interesting. Mmm. It should be fairly
easy to add a flag (possibly per partition/channel pair) for this.
In our case such a scenario would actually make sense, notably
considering the number of mailboxes (tens of millions).
Different people have different needs for their architecture, and the
more possibilities offered to cyrus users, the better. But not
everything can be implemented, and we guess choices will have to be
made.
Comments and inputs are welcomed :)
Let me know you need anything clarified in what I wrote there - that's
kind of "off the cuff", but it's stuff I've been thinking about for a
long time.
Bron.
That's indeed really well though :) Thanks for your answers.
And as explained in the initial mail we would be happy to help you where
possible, since we do have similar goals here.
Regards
Julien