Re: [PATCH] Final: thread lock nesting debugging

Linas Vepstas Thu, 20 Nov 2008 08:03:51 -0800

2008/11/19 Neil Jerram <[EMAIL PROTECTED]>:
> 2008/11/17 Linas Vepstas <[EMAIL PROTECTED]>:
>> I've been seeing all sorts of deadlocks in guile, and so I wrote a small
>> debugging utility to try to track down the problems.
>
> Interesting patch!
>
> One query; I may be being a bit dumb, I'm only just recovering from a
> bad cold, but anyway...  Your patch checks for a thread unlocking
> mutexes in the reverse order that it locked them in (let's call this
> point "A").  But I thought your recent investigations had shown that
> the problem was threads doing locking in inconsistent order, e.g.
> thread 1 locks M1 and then M2, while thread 2 locks M2 and then M1
> (point "B").  Are points "A" and "B" equivalent?  (It isn't obvious to
> me if so.)


Hi Neil,

There is (should be) only one lock in guile that is "inconsistent"
in its locking order, and this is the t->heap_mutex lock.

My guess is that valgrind is tripping over this one. I guess
I should argue that this is why one needs a custom patch,
instead of using valgrind (which is otherwise pretty fantastic
for mem corruption and the like).

The  t->heap_mutex lock is heled whenever a thread is
"guilified" or is "in guile mode".  Its primary reason for
being is to keep the garbage collector from running
until all threads have been halted. (This is done by
scm_i_thread_put_to_sleep)

After applying my set of patches, the only inconsistent
(and widespread!) lock ordering problem that I'm seeing
stems from the asymmetric way in which scm_i_scm_pthread_mutex_lock is
used to take a lock,
and then drop it.  If you follow the #define for
scm_i_scm_pthread_mutex_lock, you find that its of
the form:

   drop (thread->heap_mutex)
   take(some lock)
   take (thread->heap_mutex)

Whereas the unlock is just

   drop(some lock)

You can see this, for example, in ports.c line 728

  scm_i_scm_pthread_mutex_lock (&scm_i_port_table_mutex);
  scm_i_remove_port (port);
  scm_i_pthread_mutex_unlock (&scm_i_port_table_mutex);

Tto be "correctly" nested, the unlock should have droped
the heap mutex first, and then reacquired it.  I believe that
doing this would be enough to quiet down helgrind. (at least
for most cases ... what remains would be interesting)

OK, the above was just facts; below, some random comments
which might be incorrect (reasoning about locks can be
deceptive; I've certainly mis-reasoned several times w.r.t guile)

-- I had decided that the way that the dropping of the
lock is done is OK, and that it would be silly (and a
performance hit) to try to "fix" the unlocking order.
For debugging with helgrind, you may want to do this
anyway, but for production, it seemed un-necessary.
Prod me, and perhaps I can reconstruct why I cam to
this conclusion.

-- The reason for dropping the heap_mutex before grabbing
the other lock (for example  scm_i_port_table_mutex),
is somewhat obscure, but at one point I decided that this
was OK, and arguably correct.  As I write this, I've
forgotten why. However, this should be a focus of attention,
and re-thought-out.  If you are willing to think about it, prod
me and maybe I can say something intelligent.  Changing
this would also quiet helgrind (and boost performance).
It might be safe, I need to rethink this.

Anyway, my patch allows for this single occurance of the
out-of-sequence heap_mutex unlock.

--linas

Re: [PATCH] Final: thread lock nesting debugging

Reply via email to