Ok, thanks a lot for the for the patch attempt and subsequent in-depth
explanation. I know this thread is long-lived and a bit confusing,
mostly because I've seen two different sets of scenarios that yield
the same result, which is that unicorn does not restart.
The first scenario (and the one that started this thread) was that
unicorn actually receives the signal, spawns a new master, and that
process pegs the cpu and spins into eternity. The second scenario is
that the old master seems to completely ignore the USR2 signal the
first time, and when sent again, properly restarts. In both scenarios
the old master and workers continue to serve the old code.
I thought that setting the Unicorn::HttpServer::START_CTX[0] in my
unicorn config had solved the first scenario (pegged cpu), but I found
a new occurrence of it today, and I now have some new strace output
for this scenario (strace -v):
select(4, [3], NULL, NULL, {60, 727229}) = 0 (Timeout)
wait4(-1, 0x7ffffd70f72c, WNOHANG, NULL) = 0
clock_gettime(CLOCK_REALTIME, {1313708081, 720757662}) = 0
fstat(8, {st_dev=makedev(202, 1), st_ino=39806, st_mode=S_IFREG,
st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0,
st_size=0, st_atime=2011/08/18-14:32:10, st_mtime=2011/08/18-14:32:10,
st_ctime=2011/08/18-22:54:06}) = 0
clock_gettime(CLOCK_REALTIME, {1313708081, 721131305}) = 0
fstat(10, {st_dev=makedev(202, 1), st_ino=45370, st_mode=S_IFREG,
st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0,
st_size=0, st_atime=2011/08/18-14:32:10, st_mtime=2011/08/18-14:32:10,
st_ctime=2011/08/18-22:54:05}) = 0
clock_gettime(CLOCK_REALTIME, {1313708081, 721290800}) = 0
select(4, [3], NULL, NULL, {565, 34005}) = 0 (Timeout)
wait4(-1, 0x7ffffd70f72c, WNOHANG, NULL) = 0
clock_gettime(CLOCK_REALTIME, {1313708646, 853870540}) = 0
fstat(8, {st_dev=makedev(202, 1), st_ino=39806, st_mode=S_IFREG,
st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0,
st_size=0, st_atime=2011/08/18-14:32:10, st_mtime=2011/08/18-14:32:10,
st_ctime=2011/08/18-22:59:06}) = 0
clock_gettime(CLOCK_REALTIME, {1313708646, 854102750}) = 0
fstat(10, {st_dev=makedev(202, 1), st_ino=45370, st_mode=S_IFREG,
st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0,
st_size=0, st_atime=2011/08/18-14:32:10, st_mtime=2011/08/18-14:32:10,
st_ctime=2011/08/18-23:04:05}) = 0
clock_gettime(CLOCK_REALTIME, {1313708646, 854260655}) = 0
select(4, [3], NULL, NULL, {598, 630876}
With respect to the second scenario (ignoring signals), I'm going to
recommend we table that issue for now, as we're currently running on a
version of ubuntu (11.10) which has a known signal trapping bug with
ruby 1.9.2-p180, and downgrading to 10.04 may solve that problem.
Granted, when I've observed this in the past with other libraries, the
process becomes completely unresponsive, whereas the behavior in
unicorn is more intermittent. Either way, we are in the process of
downgrading our servers to ubuntu 10.04, so let's not waste anymore
time trying to debug something that may well be occurring due to a
kernel bug. If we continue to see this after the downgrade to 10.04,
I'll report back and we can keep digging.
Again, my apologies for the confusion between the two scenarios, and
thanks for all your help.
- Alex
_______________________________________________
Unicorn mailing list - [email protected]
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying