Hi gang,

A few weeks ago, Greg Ames posted a problem that he saw at IBM
regarding SIGTERM, piped logs, and Solaris (he said 2.6, but turns out
to be seen on Solaris 2.7 here at eBuilt).  

For Greg's original message, see:

http:[EMAIL PROTECTED]%3e

The problem was initially reported against IBM Web Server, but I've now 
been able to reproduce this with Apache 1.3.21dev (current).

And, I think I've found the bug.  At least, I think I have.  This one
isn't easy to explain.  So, *please* bear with me.  If you have any
questions, email me and I'll try to explain further.  I'm not exactly
sure what is going on or what the proper long-term solution is.  The
short-term solution is to avoid using piped ErrorLog on Solaris with
more than a few piped TransferLogs.  There just seems to be some 
horrendously odd clashes between the two options causing some race
conditions.  Either one by themselves are fine, but together causes 
problems in the shutdown sequence leading to the parent httpd
not exiting properly (and spawning a few new rotatelogs children).

The problem appears to be with the initialization of the piped log 
for the ErrorLog directive *and* the shutdown process in
reclaim_child_processes.  When an ErrorLog is piped, it is opened by
error_log_child.  Notice that this is different than the
piped_log_spawn - my gut tells me if that we make error_log_child
similar to piped_log_spawn, this entire problem might go away.  But, 
I'm not familiar enough with the other_child code in 1.3 to tell for
certain.  This also looks like fairly old code (rse in '96 and
dgaudet in '97 based on cvs annotate).

The symptoms that I see on Solaris are thus:

1) The rotatelogs process for ErrorLog directive loses its parent
   after startup.  Hence, it's ppid is init.  I'm not sure how this
   plays in, but this doesn't look right.  I also can't recreate 2
   without having ErrorLog be a piped log.

2) On shutdown (via SIGTERM), a race condition occurs.  
   reclaim_child_processes kills all of the httpd children (fine),
   but the second half of the r_c_p call is a bit odd.  Basically,
   it calls piped_log_maint with OC_REASON_DEATH - this triggers
   piped_log_spawn to start up a new child since pl->program isn't 
   NULL.  This is completely and utterly wrong.  We're supposed to
   be shutting down, not starting up.  These new children never
   receive the SIGTERM - they will stick around until another
   SIGTERM occurs.  The old children have already quit due to the
   SIGTERM, but the piped_log_spawn starts up new rotatelogs 
   processes.

Take away the ErrorLog directive - no visible problems.  Don't have 
any TransferLog directives?  No visible problem.  I think that the 
ErrorLog code is broken and we're hitting some sort of signal race 
condition in reclaim_child_processes for the TransferLog.  How are 
these interacting?  I haven't a foggy clue.

Adding:

ErrorLog "| /foo/apache-1.3/bin/rotatelogs /foo/apache-1.3/logs/piped_error_log 86400"

causes the problem on Solaris.  You'll probably need a few TransferLog
directives to cause the inadvertant fork().  When you send SIGTERM to
the parent httpd process, new rotatelogs are created and hence the
parent will never quit (these new children didn't receive the SIGTERM).
Ooooops.

I've attached the following things to maybe help others understand:

1) Truss output of parent httpd process receiving SIGTERM.
   Notice the fork within the waitpid calls.  I later killed off
   the rotatelogs and then sent SIGTERM again (it terminated).
2) gdb stack trace of reclaim_child_processes
3) ps listing showing that there is a rotatelogs with 1 as its
   parent.

Am I right in my analysis?  Does this make any sense?  What is the 
proper fix to this?  Can anyone reproduce this on non-Solaris
machines?  -- justin
poll(0xFFBEED48, 0, 1000)                       = 0
time()                                          = 991198040
poll(0xFFBEEB98, 7, 0)                          = 7
waitid(P_ALL, 0, 0xFFBEECF0, WEXITED|WTRAPPED|WNOHANG) = 0
poll(0xFFBEED48, 0, 1000)                       = 0
time()                                          = 991198041
waitid(P_ALL, 0, 0xFFBEECF0, WEXITED|WTRAPPED|WNOHANG) = 0
poll(0xFFBEED48, 0, 1000)                       = 0
time()                                          = 991198042
waitid(P_ALL, 0, 0xFFBEECF0, WEXITED|WTRAPPED|WNOHANG) = 0
    Received signal #15, SIGTERM, in poll() [caught]
      siginfo: SIGTERM pid=12483 uid=7812
poll(0xFFBEED48, 0, 1000)                       Err#4 EINTR
setcontext(0xFFBEEA30)
time()                                          = 991198043
kill(-12454, SIGTERM)                           = 0
    Received signal #15, SIGTERM [caught]
      siginfo: SIGTERM pid=12454 uid=7812
setcontext(0xFFBEEB38)
poll(0xFFBEED28, 0, 17)                         = 0
waitid(P_PID, 12473, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
waitid(P_PID, 12474, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
waitid(P_PID, 12475, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
waitid(P_PID, 12476, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
waitid(P_PID, 12477, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
waitid(P_PID, 12471, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
kill(12471, SIGTERM)                            = 0
waitid(P_PID, 12469, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
kill(12469, SIGTERM)                            = 0
waitid(P_PID, 12467, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
kill(12467, SIGTERM)                            = 0
waitid(P_PID, 12465, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
fork()                                          = 12485
waitid(P_PID, 12463, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
fork()                                          = 12487
waitid(P_PID, 12461, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
fork()                                          = 12489
waitid(P_PID, 12459, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
fork()                                          = 12491
poll(0xFFBEED28, 0, 66)                         = 0
waitid(P_PID, 12475, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
waitid(P_PID, 12476, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
waitid(P_PID, 12477, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
waitid(P_PID, 12491, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
kill(12491, SIGTERM)                            = 0
waitid(P_PID, 12489, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
kill(12489, SIGTERM)                            = 0
waitid(P_PID, 12487, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
kill(12487, SIGTERM)                            = 0
waitid(P_PID, 12485, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
kill(12485, SIGTERM)                            = 0
waitid(P_PID, 12471, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
waitid(P_PID, 12469, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
waitid(P_PID, 12467, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
poll(0xFFBEED28, 0, 263)                        = 0
waitid(P_PID, 12491, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
kill(12491, SIGTERM)                            = 0
waitid(P_PID, 12489, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
kill(12489, SIGTERM)                            = 0
waitid(P_PID, 12487, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
kill(12487, SIGTERM)                            = 0
waitid(P_PID, 12485, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
kill(12485, SIGTERM)                            = 0
poll(0xFFBEED28, 0, 1049)       (sleeping...)
poll(0xFFBEED28, 0, 1049)                       = 0
waitid(P_PID, 12491, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
kill(12491, SIGTERM)                            = 0
waitid(P_PID, 12489, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
kill(12489, SIGTERM)                            = 0
waitid(P_PID, 12487, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
kill(12487, SIGTERM)                            = 0
waitid(P_PID, 12485, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
kill(12485, SIGTERM)                            = 0
poll(0xFFBEED28, 0, 4195)       (sleeping...)
poll(0xFFBEED28, 0, 4195)                       = 0
waitid(P_PID, 12491, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
kill(12491, SIGTERM)                            = 0
waitid(P_PID, 12489, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
kill(12489, SIGTERM)                            = 0
waitid(P_PID, 12487, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
kill(12487, SIGTERM)                            = 0
waitid(P_PID, 12485, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
kill(12485, SIGTERM)                            = 0
poll(0xFFBEED28, 0, 16778)      (sleeping...)
    Received signal #15, SIGTERM, in poll() [caught]
      siginfo: SIGTERM pid=12505 uid=7812
poll(0xFFBEED28, 0, 16778)                      Err#4 EINTR
setcontext(0xFFBEEA10)
waitid(P_PID, 12491, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
kill(12491, SIGTERM)                            = 0
waitid(P_PID, 12489, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
kill(12489, SIGTERM)                            = 0
waitid(P_PID, 12487, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
kill(12487, SIGTERM)                            = 0
waitid(P_PID, 12485, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
kill(12485, SIGTERM)                            = 0
poll(0xFFBEED28, 0, 67109)      (sleeping...)
    Received signal #15, SIGTERM, in poll() [caught]
      siginfo: SIGTERM pid=27531 uid=7812
poll(0xFFBEED28, 0, 67109)                      Err#4 EINTR
setcontext(0xFFBEEA10)
waitid(P_PID, 12491, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
kill(12491, SIGTERM)                            = 0
waitid(P_PID, 12489, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
kill(12489, SIGTERM)                            = 0
waitid(P_PID, 12487, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
kill(12487, SIGTERM)                            = 0
waitid(P_PID, 12485, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
kill(12485, SIGTERM)                            = 0
poll(0xFFBEED28, 0, 268436)     (sleeping...)
poll(0xFFBEED28, 0, 268436)                     = 0
waitid(P_PID, 12491, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
waitid(P_PID, 12489, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
waitid(P_PID, 12487, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
waitid(P_PID, 12485, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
unlink("/home/jerenkrantz/apache-1.3/logs/httpd.pid") = 0
getpid()                                        = 12454 [1]
time()                                          = 991198337
write(5, " [ T u e   M a y   2 9  ".., 66)      Err#32 EPIPE
    Received signal #13, SIGPIPE [ignored]
close(5)                                        = 0
waitid(P_PID, 12457, 0xFFBEEAC0, WEXITED|WTRAPPED|WNOHANG) = 0
close(21)                                       = 0
close(20)                                       = 0
close(18)                                       = 0
close(19)                                       = 0
close(16)                                       = 0
close(17)                                       = 0
close(13)                                       = 0
close(14)                                       = 0
close(11)                                       = 0
close(12)                                       = 0
close(9)                                        = 0
close(10)                                       = 0
close(7)                                        = 0
close(8)                                        = 0
close(3)                                        = 0
close(6)                                        = 0
close(15)                                       = 0
llseek(0, 0, SEEK_CUR)                          = 0
_exit(0)
jerenkra 14112 14111  0 22:48:50 ?        0:00 
/home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apache-1.3/logs/i
jerenkra 14115 14113  0 22:48:50 ?        0:00 
/home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apache-1.3/logs/i
jerenkra 14118 14117  0 22:48:50 ?        0:00 
/home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apache-1.3/logs/i
jerenkra 14110 14108  0 22:48:50 ?        0:00 
/home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apache-1.3/logs/i
jerenkra 14111 14097  0 22:48:50 ?        0:00 /bin/sh -c  
/home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apach
jerenkra 14103 14102  0 22:48:50 ?        0:00 
/home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apache-1.3/logs/i
jerenkra 14107 14106  0 22:48:50 ?        0:00 
/home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apache-1.3/logs/i
jerenkra 14106 14097  0 22:48:50 ?        0:00 /bin/sh -c  
/home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apach
jerenkra 14108 14097  0 22:48:50 ?        0:00 /bin/sh -c  
/home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apach
jerenkra 14104 14097  0 22:48:50 ?        0:00 /bin/sh -c  
/home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apach
jerenkra 14114 14097  0 22:48:50 ?        0:00 /bin/sh -c  
/home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apach
jerenkra 14127 13768  0 22:49:20 pts/5    0:00 grep rotate
jerenkra 14116 14114  0 22:48:50 ?        0:00 
/home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apache-1.3/logs/i
jerenkra 14117 14097  0 22:48:50 ?        0:00 /bin/sh -c  
/home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apach
jerenkra 14113 14097  0 22:48:50 ?        0:00 /bin/sh -c  
/home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apach
jerenkra 14082     1  0 22:48:47 pts/5    0:00 
/home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apache-1.3/logs/i
jerenkra 14102 14097  0 22:48:50 ?        0:00 /bin/sh -c  
/home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apach
jerenkra 14105 14104  0 22:48:50 ?        0:00 
/home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apache-1.3/logs/i
#0  piped_log_spawn (pl=0xce420) at http_log.c:611
#1  0x4c190 in piped_log_maintenance (reason=0, data=0xce420, status=13) at 
http_log.c:658
#2  0x4e85c in reclaim_child_processes (terminate=1) at http_main.c:2495
#3  0x51bd8 in standalone_main (argc=1, argv=0xffbeefe4) at http_main.c:4779
#4  0x52484 in main (argc=1, argv=0xffbeefe4) at http_main.c:5010

Reply via email to