This might be long, I'm going to front-load a bunch of background information to provide context to my eventual questions.
I'm talking primarily about sync_client in rolling aka repeat mode, where it processes a channel's sync log ad infinitum. The one-shot invocations of sync_client are largely out of scope here. Also, hereafter when I say "sync_server", I mean both sync_server itself, and also the sync commands in imapd when the replication capability is enabled, since their underlying implementation is literally the same. In the current state of the world, sync_client has no idea what to do with a "NO IMAP_MAILBOX_LOCKED" response. This is generally not a problem: sync_server always blocks when obtaining mailbox locks, and so it never returns this status. In the pathological case where it blocks longer than sync_client is willing to wait (half an hour I believe), sync_client will treat the connection as dropped, abort its current run, wait for its sync_repeat_interval to elapse, and then process the entire aborted sync log again (and so on until it eventually succeeds, and then it gets on with the next log, and eventually catches up). As I understand it, replication targets are generally not expected to be doing much of anything, so the chances a destination mailbox is locked for long enough to be a problem right at the moment sync_client cares about it are pretty slim. Enter backupd. backupd functions pretty similarly to sync_server, except of course that its backing store is backup files, rather than mailboxes. Backup files are per-user, not per-mailbox, which means a broader locking granularity, which means more chance of contention. I also expect it's just more likely to encounter locked backups: * The compact process will probably take a while for large users, especially if run infrequently * An administrator (or tool) might spend a while browsing a user's backup to locate content that needs restoring * The restore process might take a while for large restorations * Someone debugging a backup index might get distracted while holding the lock... (sorry, self) At the moment, backupd has no choice but to block in the same way that sync_server does. But that means that, while it's waiting around for one user backup to become available, other user's backups are becoming stale. This might not be the end of the world, but it's hardly ideal. What I'd like to do is have backupd open backups in non-blocking mode, and immediately return a "NO IMAP_MAILBOX_LOCKED" response if it's unable to (this bit is trivial). At the moment, sync_client treats "NO IMAP_MAILBOX_LOCKED" like any other error: it aborts the entire run, waits until its sync_repeat_interval elapses, and then starts the aborted run again from the top. Any user it didn't get to before hitting the locked one will keep getting missed until the locked one succeeds. (And the logs get noisy with the errors.) (Actually, it promotes failed small-granularity syncs, like META, MAILBOX, etc up to USER, and then eventually aborts as described if the USER level also fails.) So what I mean is, sync_client is currently -somewhat- able to handle a "NO IMAP_MAILBOX_LOCKED" response, but not very gracefully. I would like to make it recognise and sensibly handle this response -- effectively skipping over the locked user(s), carrying on with the rest of the job, and trying the locked ones again later, when they're hopefully no longer locked. I can see a few ways of implementing this: A) Add a new sync_action_list for locked users. When a USER fails due to IMAP_MAILBOX_LOCKED, add them to the locked users list, and keep going as usual. (If any smaller granularity fails, it gets promoted to USER.) Re-process the locked users list (with suitable delay time/retry limit config) once everything else has been processed. Abort the run as usual if the retry limit is reached and there are still unprocessed users in the locked list. B) Same as A, but instead of aborting if the retry limit is hit, sync_log() the affected user and succeed. The locked user will be handed down from log to log until it eventually succeeds, and in the meantime everyone else gets backed up normally. C) Same as B, but only retry once (not in a loop) before they get sync_logged for later processing. D) Don't add the new sync_action_list. If any operation returns IMAP_MAILBOX_LOCKED, just sync_log() that operation and continue, and let the next run deal with it. Option A is already implemented on my github, and seems to work ok: https://github.com/elliefm/cyrus-imapd/compare/v30/t245-sync-client-lock-retry I thought of using sync_log as a solution to the "it still might abort on long-held locks" problem, and then thinking through that, saw the other options. I'm not sure of the safety of calling sync_log() from within sync_client itself. It *looks* like it should be totally fine, but maybe I've missed something? If calling sync_log() from sync_client is not safe, then A is basically the solution, and I just need to polish it a little more and push it up. But if it is safe, then the other options start looking better: B is the most code, but allows us to handle locked backups within the same run if the lock turns out to not be held for long. C has a similar benefit, and similar amount of code, but won't wait around as long before deferring it to the next run. D is (I think) the smallest implementation, but any locked backup won't be processed again at all for at least sync_repeat_interval, even if it was only locked for a moment. I'm inclining towards D at the moment (less code=>less bugs), even though it means throwing away most of my existing implementation and starting again. But it depends on sync_log() being okay, which I'm not sure about. Thoughts? (None of this will have any impact on conventional sync_client => sync_server replication, since sync_server never returns a "NO IMAP_MAILBOX_LOCKED" response. It could open up possibilities in the future, but I'm not exploring those here.)