On Wed, Mar 14, 2012 at 11:02 AM, Wendy Cheng <[email protected]> wrote: > This mutex seems to only exist with Storage Engine branch. > > We're working on prototyping a new storage engine on top of an NVRAM > device. This morning, the test server hangs. Four worker threads are > blocked waiting for cache_lock. The cache_lock holder is blocked in > "notify_io_complete()" waiting for the thread mutex (thru LOCK_THREAD > macro) - it is our add-on completion handler (thread) that does not > get dispatched via the existing memcached's event_handler(). Examining > the source code by eyes, I don't seem to be able to find the place > where the LOCK_THREAD could be invoked elsewhere. > > I'm still checking the code. At the same time, not sure whether folks > can pass the following info to speed up this debugging: > > 1) what is a "tap_thread" ? > 2) what structures are protected by this mutex ? >
ok, found the place where it deadlocked. Our completion thread tried to complete the io (by notify_io_complete()) for the memcached worker thread while holding the cache_lock. The worker thread locked itself (thread->mutex) before entering "process_command()"; then tried to obtain the cache_lock. This implies I can't call notify_io_complete() while holding cache_lock. Guess I need to spend times to get myself familiar with network protocol logic, instead of isolating within storage engine itself. . It would be nice to know what is a tap thread but I'm ok for now. Thanks, Wendy
