On Fri, May 24, 2013, at 01:34 AM, Karl Pielorz wrote: > We're running cyrus-imapd-2.4.17 on FreeBSD. I've been looking at the > replication built into Cyrus, but can't find much (if any) > documentation on it. > > e.g. The shipped 'install-replication.html' file ends at: > > " ... You can also run cyr_synclog(8) instead, which will insert the > record into the rolling replication log. > > Failover " > > And that's it. Is there anywhere I can find more info on replication / > failover / setup, a 'howto' - or anything?
I'm afraid there isn't much :( Feel free to ask questions about specific things you run in to, and we can use that as a basis to put together more detailed documentation. Failover is kind of messy at the moment, because it's so site-dependent how you want to manage your failover. Our process at FastMail looks like this: 1) update database to mark the server as "moving" so new connections get paused at nginx/web server level and then wait 2 seconds for the config to be updated. 2) send a signal to the 'master' process to shut the server down. 3) wait for up to 10 seconds while grepping the process list every second for ongoing processes related to the same instance (we use -C $imapd_conf because we run many instances of Cyrus per server) 4) if the processes aren't dead after 10 seconds, kill them individually. 5) if THAT fails, kill -9. (yeah, I know - evil!) 6) check the $confdir/sync directory for log files, and run them with sync_client to ensure all replication is up to date. If anything before this failed, we bring this master back up and report that the failover didn't succeed. 7) shut down the replica 8) restart this side with the replica configuration 9) restart the other side with the master configuration At the moment, we still move a master IP address to the instance which is running as master, meaning clients can reconnect to the same IP address after the failover. This is on its way out - we're now at the point where almost everything can read configuration from our "fmstatus.json" file which is updated every second on every host, so they know where the master is actually located. Obviously, a ton of this is really site-specific to us. Soon (yes, soon!) we will be shifting to a full multi-master setup, where failover is as simple as pointing clients to the other end of the replication pair, and killing off existing connections so they reconnect (with some sync_log checking and force-running), which should shave quite a few seconds off the sync time and mean that long running squatter jobs and other things don't get nuked off at the same time. But yeah, it's not quite a turn-key system :( Bron. -- Bron Gondwana br...@fastmail.fm