Hi, As I've mentioned on the mailing list, we have had to put quite a lot of infrastructure around Cyrus to make replication robust in all cases.
While the core replication protocol seems pretty stable now, and with GUID stuff it will be easier to do integrity checks, it's still very much not a turn-key solution. Every site will have to put a lot of effort into understanding how everything works and building their own systems for keeping things running. ... while there is some commercial advantage theory in us knowing how to do stuff and not telling anyone else, we think that's outweighed by having replication so good that lots of people are using it and helping improve Cyrus ... So I'd like to start a dialogue on the topic of making Cyrus replication robust across failures with the following goals: a) MUST never lose a message that's been accepted for delivery except in the case of total drive failure. b) MUST have a standard way to integrity check and repair a replica-pair after a system crash. c) MUST have a clean process to "soft-failover" to the replica machine, making sure that all replication events from the ex-master have been synchronised. d) MUST have replication start/restart automatically when the replica is available rather than requiring it be online at master start time. e) SHOULD be able to copy back messages which only exist on the replica due to a hard-failover, handling UIDs gracefully (more on this later), alternatively as least MUST (to satisfy point 'a') notify the administrator that the message has different GUIDs on the two copies and something will need to be done about it (to satisfy point 'd' this must be done without bailing out replication for the remaining messages in the folder) f) SHOULD keep replicating in the face of an error which affects a single mailbox, keeping track of that mailbox so that a sysadmin can fix the issue and then replicate that mailbox hand. g) MAY have a method to replicate to two different replicas concurrently (replay the same sync_log messages twice) allowing one replica to be taken out of service and a new one created while having no "gaps" in which there is no second copy alive (we use rsync, rsync again, stop replication, rsync a third time, start replication to the new site - but it's messy and gappy) Does that sound like a reasonable set of goals to everyone? Have I missed anything important? I'd like to see a situtation where everyone who wants to run replication can turn it on and pretty much forget about it, trusting that it will just_work[tm] (though reading the log files just in case!) COPYING BACK and UIDs: ====================== The easy approach: * give both messages a new UID after uid_last and delete the old UIDs on both machines The tricky approach: * track the highest UID that has ever been presented to an IMAP client. UIDs above that point are "soft" UIDs, and you don't need to worry about changing them, so select whichever end has actually had the UID seen and keep the message there with the same UID, injecting only the other message with a new UID. * if both ends have been viewed via IMAP then your client is probably already confused. Fall back to nuking that UID and injecting both messages again with higher UIDs. Regards, Bron.