> D) Don't add the new sync_action_list. If any operation returns > IMAP_MAILBOX_LOCKED, just sync_log() that operation and continue, and > let the next run deal with it. > > [...] > > I'm inclining towards D at the moment (less code=>less bugs), even > though it means throwing away most of my existing implementation and > starting again. But it depends on sync_log() being okay, which I'm not > sure about.
So I gave this option a shot, and it works okay in my testing. Implementation is here: https://github.com/elliefm/cyrus-imapd/compare/v30/t245-sync-client-lock-retry-take2 I like this implementation a lot more than the last, so I'll push it upstream in a couple of days if no-one else has an opinion either way. On Wed, Jun 22, 2016, at 12:25 PM, ellie timoney via Cyrus-devel wrote: > This might be long, I'm going to front-load a bunch of background > information to provide context to my eventual questions. > > I'm talking primarily about sync_client in rolling aka repeat mode, > where it processes a channel's sync log ad infinitum. The one-shot > invocations of sync_client are largely out of scope here. > > Also, hereafter when I say "sync_server", I mean both sync_server > itself, and also the sync commands in imapd when the replication > capability is enabled, since their underlying implementation is > literally the same. > > In the current state of the world, sync_client has no idea what to do > with a "NO IMAP_MAILBOX_LOCKED" response. This is generally not a > problem: sync_server always blocks when obtaining mailbox locks, and so > it never returns this status. In the pathological case where it blocks > longer than sync_client is willing to wait (half an hour I believe), > sync_client will treat the connection as dropped, abort its current run, > wait for its sync_repeat_interval to elapse, and then process the entire > aborted sync log again (and so on until it eventually succeeds, and then > it gets on with the next log, and eventually catches up). > > As I understand it, replication targets are generally not expected to be > doing much of anything, so the chances a destination mailbox is locked > for long enough to be a problem right at the moment sync_client cares > about it are pretty slim. > > Enter backupd. > > backupd functions pretty similarly to sync_server, except of course that > its backing store is backup files, rather than mailboxes. Backup files > are per-user, not per-mailbox, which means a broader locking > granularity, which means more chance of contention. I also expect it's > just more likely to encounter locked backups: > > * The compact process will probably take a while for large users, > especially if run infrequently > * An administrator (or tool) might spend a while browsing a user's > backup to locate content that needs restoring > * The restore process might take a while for large restorations > * Someone debugging a backup index might get distracted while holding > the lock... (sorry, self) > > At the moment, backupd has no choice but to block in the same way that > sync_server does. But that means that, while it's waiting around for > one user backup to become available, other user's backups are becoming > stale. This might not be the end of the world, but it's hardly ideal. > > What I'd like to do is have backupd open backups in non-blocking mode, > and immediately return a "NO IMAP_MAILBOX_LOCKED" response if it's > unable to (this bit is trivial). > > At the moment, sync_client treats "NO IMAP_MAILBOX_LOCKED" like any > other error: it aborts the entire run, waits until its > sync_repeat_interval elapses, and then starts the aborted run again from > the top. Any user it didn't get to before hitting the locked one will > keep getting missed until the locked one succeeds. (And the logs get > noisy with the errors.) > > (Actually, it promotes failed small-granularity syncs, like META, > MAILBOX, etc up to USER, and then eventually aborts as described if the > USER level also fails.) > > So what I mean is, sync_client is currently -somewhat- able to handle a > "NO IMAP_MAILBOX_LOCKED" response, but not very gracefully. > > I would like to make it recognise and sensibly handle this response -- > effectively skipping over the locked user(s), carrying on with the rest > of the job, and trying the locked ones again later, when they're > hopefully no longer locked. > > I can see a few ways of implementing this: > > A) Add a new sync_action_list for locked users. When a USER fails due > to IMAP_MAILBOX_LOCKED, add them to the locked users list, and keep > going as usual. (If any smaller granularity fails, it gets promoted to > USER.) Re-process the locked users list (with suitable delay time/retry > limit config) once everything else has been processed. Abort the run as > usual if the retry limit is reached and there are still unprocessed > users in the locked list. > > B) Same as A, but instead of aborting if the retry limit is hit, > sync_log() the affected user and succeed. The locked user will be > handed down from log to log until it eventually succeeds, and in the > meantime everyone else gets backed up normally. > > C) Same as B, but only retry once (not in a loop) before they get > sync_logged for later processing. > > D) Don't add the new sync_action_list. If any operation returns > IMAP_MAILBOX_LOCKED, just sync_log() that operation and continue, and > let the next run deal with it. > > Option A is already implemented on my github, and seems to work ok: > https://github.com/elliefm/cyrus-imapd/compare/v30/t245-sync-client-lock-retry > > I thought of using sync_log as a solution to the "it still might abort > on long-held locks" problem, and then thinking through that, saw the > other options. > > I'm not sure of the safety of calling sync_log() from within sync_client > itself. It *looks* like it should be totally fine, but maybe I've > missed something? > > If calling sync_log() from sync_client is not safe, then A is basically > the solution, and I just need to polish it a little more and push it up. > > But if it is safe, then the other options start looking better: B is > the most code, but allows us to handle locked backups within the same > run if the lock turns out to not be held for long. C has a similar > benefit, and similar amount of code, but won't wait around as long > before deferring it to the next run. D is (I think) the smallest > implementation, but any locked backup won't be processed again at all > for at least sync_repeat_interval, even if it was only locked for a > moment. > > I'm inclining towards D at the moment (less code=>less bugs), even > though it means throwing away most of my existing implementation and > starting again. But it depends on sync_log() being okay, which I'm not > sure about. > > Thoughts? > > (None of this will have any impact on conventional sync_client => > sync_server replication, since sync_server never returns a "NO > IMAP_MAILBOX_LOCKED" response. It could open up possibilities in the > future, but I'm not exploring those here.)