(Patch and system details at bottom) Hi all. I've root-caused and written a patch for the children stuck on futex problem described by both Sean Thorne in 2009 and Max Barry (who I work with) in 2011.
The core of the problem is that modperl_tipool_putback_base only broadcasts that there are more interpreters available when there were no available interpreters prior to this putback. While this makes sense, it can create a problem. Notation: A: Acquire an interpreter P: Putback an interpreter B: Broadcast a free intepreter (really a signal) W: Wait on condition tipool->available (for free interpreter) (x,y): x is number of free interpreters at this point. y is the number in use. The number at the beginning of a line is the thread number Each line occurs within a single critical section (on mutex tipool->tiplock) Expected behavior: 4 threads, 2 free interpreters 1: A (1,1) 2: A (2,0) 3: W 4: W 1: P (1,1) B 3: A (2,0) 2: P (1,1) B 4: A (2,0) 3: P (1,1) B 4: P (0,2) <-- No broadcast because there was an available interpreter prior to this putback. Broken behavior: 4 threads, 2 free interpreters 1: A (1,1) 2: A (2,0) 3: W 4: W 1: P (1,1) B 2: P (0,2) <-- No broadcast because there was an available interpreter prior to this putback. 3: A (1,1) 3: P (0,2) <-- No broadcast because there was an available interpreter prior to this putback. (Broken) Thread 4 will never be signaled to pick up an interpreter. This results in the thread getting stuck on futex because sooner or later, apache will tell this worker to die (due to MaxRequestsPerChild). So, the parent thread will wait on the child threads joining, but one or more child threads will never wake up due to this problem. My proposed fix is to always broadcast the availability of an interpreter, regardless of whether there were already any free. This change passes all tests that I have found to throw at it as well as no longer deadlocking when reproducing the problem according to Max's instructions (http://pastebin.com/YDbmq84w). My System Details: uname -a: Linux modperl 2.6.38-8-server #42-Ubuntu SMP Mon Apr 11 03:49:04 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux Apache: Custom build of 2.2.20 with ubuntu patches (http://packages.ubuntu.com/source/oneiric/apache2) modperl: Custom build of 2.0.5 with ubuntu patches (http://packages.ubuntu.com/source/oneiric/libapache2-mod-perl2) Build process: Standard ubuntu build process with following flags set: DEB_BUILD_OPTIONS="nostrip parallel=2 debug" CFLAGS="-g -O2 -DMP_TRACE=1 -DPERL_DESTRUCT_LEVEL=2 -DMP_DEBUG=1 -UMP_USE_GTOP -I/usr/include/libgtop-2.0/ -I/usr/include/glib-2.0/ -I/usr/lib/x86_64-linux-gnu/glib-2.0/include/" Patch: --- src/modules/perl/modperl_tipool.c.old 2012-03-03 19:43:57.112152297 -0800 +++ src/modules/perl/modperl_tipool.c 2012-03-03 04:28:31.000000000 -0800 @@ -328,9 +328,9 @@ MP_TRACE_i(MP_FUNC, "0x%lx now available (%d in use, %d running)", (unsigned long)listp->data, tipool->in_use, tipool->size); + modperl_tipool_broadcast(tipool); if (tipool->in_use == (tipool->cfg->max - 1)) { /* hurry up, another thread may be blocking */ - modperl_tipool_broadcast(tipool); modperl_tipool_unlock(tipool); return; } Please let me know how best to get this checked in and out. As you might imagine, this futex problem has been causing us quite a few headaches :-) Greg Rubin --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@perl.apache.org For additional commands, e-mail: dev-h...@perl.apache.org