Hi, 2008/11/8 Ludovic Courtès <[EMAIL PROTECTED]>: > Hello! > > "Linas Vepstas" <[EMAIL PROTECTED]> writes: > >> I've got a little deadlock problem w/ guile. Here's the pseudocode:
I've got a much, much simpler case, see below. > Can you try to provide actual code to reproduce the problem? :-) > That would be great. Sure, but you won't enjoy debugging it. Or even building it. Go to https://code.launchpad.net/~opencog-dev check out the branch called "staging". Build it. then cd to directory opencog/scm and run load.sh > Did you compile Guile with thread support Yes. You will observe that the stack traces I sent were deadlocked in garbage collection? Without thread support, how could things possibly deadlock? Anyway, I have an even simpler variant, with only *one* thread deadlocked in gc. Here's the scenario: thread A: scm_init_guile(); does some other stuff, then goes to sleep in select, waiting on socket input (as expected). thread B: scm_init_guile() -- hangs here. B deadlocks with the stack trace below: #0 0xffffe425 in __kernel_vsyscall () #1 0xf7e60589 in __lll_lock_wait () from /lib/tls/i686/cmov/libpthread.so.0 #2 0xf7e5bba6 in _L_lock_95 () from /lib/tls/i686/cmov/libpthread.so.0 #3 0xf7e5b58a in pthread_mutex_lock () from /lib/tls/i686/cmov/libpthread.so.0 #4 0xf7844464 in scm_i_thread_put_to_sleep () at threads.c:1615 #5 0xf77eeca9 in scm_i_gc (what=0xf786422e "cells") at gc.c:552 #6 0xf77eefed in scm_gc_for_newcell (freelist=0xf787984c, free_cells=0x99fa25c) at gc.c:509 #7 0xf7843bff in guilify_self_2 (parent=0xf76b0e70) at ../libguile/inline.h:115 #8 0xf7845a9b in scm_i_init_thread_for_guile (base=0xf3b8a000, parent=0xf76b0e70) at threads.c:578 #9 0xf7845d82 in scm_init_guile () at threads.c:682 #10 0xf796f928 in opencog::SchemeEval::thread_init (this=0x995bc38) at ... I built guile, and added debug prints: one to scm_enter_guile, which takes a lock, and one to scm_leave_guile, with drops a lock. I also put prints into scm_i_thread_put_to_sleep (). The behaviour is very clear, and every simple: when scm_init_guile is called in thread A, the result is that it is in "guile mode" i.e. holding the lock -- it is created holding the lock. There's a series of pairs of calls to leave..enter which are always paired up. Anyway, when thread A finally goes to sleep waiting on input, it does so with its lock held. Read libguile/threads.c:scm_enter_guile() to see what I mean: static void scm_enter_guile (scm_t_guile_ticket ticket) { scm_i_thread *t = (scm_i_thread *)ticket; if (t) { scm_i_pthread_mutex_lock (&t->heap_mutex); resume (t); } } while scm_leave_guile() does symmetrically the opposite. Anyway, at this point, thread A is sleeping, holding the above lock, because the last guile-thing it did was to call scm_enter_guile(). Then *later on*, thread B calls scm_init_guile(), and hangs, very clearly in scm_i_thread_put_to_sleep(). The printf reveal the hangs happen here: /* Signal all threads to go to sleep */ scm_i_thread_go_to_sleep = 1; for (t = all_threads; t; t = t->next_thread) { scm_i_pthread_mutex_lock (&t->heap_mutex); } Well, the for loop then gets stuck, waiting for the lock. But the lock will never be granted, because thread A is holding it, and is in permanent sleep. As a result, thread B is blocked, forever, thus a deadlock. I'm somewhat stumped, because I can't imagine how this code *ever* could have worked in the first place. The deadlock seems really blatent to me. It seems criminal for guile to *ever* return to the caller, while still holding a lock of any sort. But every clearly, scm_init_guile() (and I guess most other calls) return to C code, with a lock held. This is just begging for deadlocks! --linas
