Hi Bron, On Sunday, 19. July 2015 10:05:12 you wrote: > On Sun, Jul 19, 2015, at 07:06 AM, Thomas Jarosch wrote: > > mailbox_release_resources() closes the file descriptor > > to the index. This implicitly releases any lock on the file. > > > > Unfortunately the function does not reset the index_locktype state. > > Therefore we didn't notice that a mailbox rename > > operated on an unlocked index of the new mailbox. Yikes. > > How did you discover this? I'm surprised we haven't run into this > already.
currently cyrus-imapd 2.4.18 deadlocks about once a week. Similar symptoms as described here: http://lists.andrew.cmu.edu/pipermail/info-cyrus/2013-May/036951.html So I'm now going through the locking code. Of course I can't reproduce it on my development machine and I tried really hard. The only pattern I can see is that it happens one machines with very large mailboxes. They run "quota -f" nightly, ipurge to delete trashed mails older than 30 days and the cyr_getdeleted tool I posted recently. -> the more cyrus tools we run in parallel, the worse the problem seems. > > Note: mailbox_index_lock() implies a mailbox_read_index_header(). > > But it also does more work than just read the index header, > > so there might be side effects. > > I changed to mailbox_index_lock_internal() since we don't want to abort > on a DELETED mailbox. ok > > @@ -4370,7 +4371,7 @@ static int mailbox_reconstruct_create(const char > > *name, struct mailbox **mbptr)> > > /* Attempt to open index */ > > r = mailbox_open_index(mailbox); > > > > - if (!r) r = mailbox_read_index_header(mailbox); > > + if (!r) r = mailbox_lock_index(mailbox, LOCK_EXCLUSIVE); > > > > if (r) { > > > > printf("%s: failed to read index header\n", mailbox->name); > > syslog(LOG_ERR, "failed to read index header for %s", mailbox- >name); > > I'm not convinced by this one - the whole point with reconstruct is that > you're working with a potentially broken mailbox, and you want to check > each step more simply to see if it works. I've left this one in place. if you back out that change, it won't work anymore: mailbox_read_index_header() will bail out because the index is not locked. I've tested that code path by faking the return code of mailbox_open() so it executes the reconstruct_create() code path. Right now I've interfaced the locking code with helgrind's lock validation mechanism and annotated all lock_xx() calls. That works more or less. Since kernel 3.14.x, there's a userspace version of "lockdep" called liblockdep. I'll switch to that soon as it supports "tryLock()" style calls as we do with lock_nonblocking(). The tricky part is that cyrus is a multi-process tool instead of a multi- threaded one, so the current lock validation tools are more or less useless. I came up with a workaround: Dump the lock state information in lock_xxx() to a file, probably JSON. A "daemon" tool will later on read that "lock transation file" and feed the information in the lock validation tool / library. Hopefully this works out, I have a feeling we have a classic ABBA deadlock somewhere in the code. Cheers, Thomas