2008/11/19 Neil Jerram <[EMAIL PROTECTED]>: > 2008/11/17 Linas Vepstas <[EMAIL PROTECTED]>: >> I've been seeing all sorts of deadlocks in guile, and so I wrote a small >> debugging utility to try to track down the problems. > > Interesting patch! > > One query; I may be being a bit dumb, I'm only just recovering from a > bad cold, but anyway... Your patch checks for a thread unlocking > mutexes in the reverse order that it locked them in (let's call this > point "A"). But I thought your recent investigations had shown that > the problem was threads doing locking in inconsistent order, e.g. > thread 1 locks M1 and then M2, while thread 2 locks M2 and then M1 > (point "B"). Are points "A" and "B" equivalent? (It isn't obvious to > me if so.)
Hi Neil, There is (should be) only one lock in guile that is "inconsistent" in its locking order, and this is the t->heap_mutex lock. My guess is that valgrind is tripping over this one. I guess I should argue that this is why one needs a custom patch, instead of using valgrind (which is otherwise pretty fantastic for mem corruption and the like). The t->heap_mutex lock is heled whenever a thread is "guilified" or is "in guile mode". Its primary reason for being is to keep the garbage collector from running until all threads have been halted. (This is done by scm_i_thread_put_to_sleep) After applying my set of patches, the only inconsistent (and widespread!) lock ordering problem that I'm seeing stems from the asymmetric way in which scm_i_scm_pthread_mutex_lock is used to take a lock, and then drop it. If you follow the #define for scm_i_scm_pthread_mutex_lock, you find that its of the form: drop (thread->heap_mutex) take(some lock) take (thread->heap_mutex) Whereas the unlock is just drop(some lock) You can see this, for example, in ports.c line 728 scm_i_scm_pthread_mutex_lock (&scm_i_port_table_mutex); scm_i_remove_port (port); scm_i_pthread_mutex_unlock (&scm_i_port_table_mutex); Tto be "correctly" nested, the unlock should have droped the heap mutex first, and then reacquired it. I believe that doing this would be enough to quiet down helgrind. (at least for most cases ... what remains would be interesting) OK, the above was just facts; below, some random comments which might be incorrect (reasoning about locks can be deceptive; I've certainly mis-reasoned several times w.r.t guile) -- I had decided that the way that the dropping of the lock is done is OK, and that it would be silly (and a performance hit) to try to "fix" the unlocking order. For debugging with helgrind, you may want to do this anyway, but for production, it seemed un-necessary. Prod me, and perhaps I can reconstruct why I cam to this conclusion. -- The reason for dropping the heap_mutex before grabbing the other lock (for example scm_i_port_table_mutex), is somewhat obscure, but at one point I decided that this was OK, and arguably correct. As I write this, I've forgotten why. However, this should be a focus of attention, and re-thought-out. If you are willing to think about it, prod me and maybe I can say something intelligent. Changing this would also quiet helgrind (and boost performance). It might be safe, I need to rethink this. Anyway, my patch allows for this single occurance of the out-of-sequence heap_mutex unlock. --linas
