Anybody have any thoughts on this?  I haven't received any responses.
This bug is a big PITA on Solaris...

I think this patch might work (haven't tested it), but I'm not really 
sure.  Someone who knows the otherchild code would be able to tell 
for sure.  -- justin

Index: http_main.c
===================================================================
RCS file: /home/cvspublic/apache-1.3/src/main/http_main.c,v
retrieving revision 1.535
diff -u -r1.535 http_main.c
--- http_main.c 2001/04/12 17:49:26     1.535
+++ http_main.c 2001/06/01 23:50:35
@@ -2492,7 +2492,7 @@
            waitret = waitpid(ocr->pid, &status, WNOHANG);
            if (waitret == ocr->pid) {
                ocr->pid = -1;
-               (*ocr->maintenance) (OC_REASON_DEATH, ocr->data, (ap_wait_t)status);
+               (*ocr->maintenance) (OC_REASON_RESTART, ocr->data, (ap_wait_t)status);
            }
            else if (waitret == 0) {
                (*ocr->maintenance) (OC_REASON_RESTART, ocr->data, (ap_wait_t)-1);

On Wed, May 30, 2001 at 12:15:21AM -0700, Justin Erenkrantz wrote:
> Hi gang,
> 
> A few weeks ago, Greg Ames posted a problem that he saw at IBM
> regarding SIGTERM, piped logs, and Solaris (he said 2.6, but turns out
> to be seen on Solaris 2.7 here at eBuilt).  
> 
> For Greg's original message, see:
> 
> http:[EMAIL PROTECTED]%3e
> 
> The problem was initially reported against IBM Web Server, but I've now 
> been able to reproduce this with Apache 1.3.21dev (current).
> 
> And, I think I've found the bug.  At least, I think I have.  This one
> isn't easy to explain.  So, *please* bear with me.  If you have any
> questions, email me and I'll try to explain further.  I'm not exactly
> sure what is going on or what the proper long-term solution is.  The
> short-term solution is to avoid using piped ErrorLog on Solaris with
> more than a few piped TransferLogs.  There just seems to be some 
> horrendously odd clashes between the two options causing some race
> conditions.  Either one by themselves are fine, but together causes 
> problems in the shutdown sequence leading to the parent httpd
> not exiting properly (and spawning a few new rotatelogs children).
> 
> The problem appears to be with the initialization of the piped log 
> for the ErrorLog directive *and* the shutdown process in
> reclaim_child_processes.  When an ErrorLog is piped, it is opened by
> error_log_child.  Notice that this is different than the
> piped_log_spawn - my gut tells me if that we make error_log_child
> similar to piped_log_spawn, this entire problem might go away.  But, 
> I'm not familiar enough with the other_child code in 1.3 to tell for
> certain.  This also looks like fairly old code (rse in '96 and
> dgaudet in '97 based on cvs annotate).
> 
> The symptoms that I see on Solaris are thus:
> 
> 1) The rotatelogs process for ErrorLog directive loses its parent
>    after startup.  Hence, it's ppid is init.  I'm not sure how this
>    plays in, but this doesn't look right.  I also can't recreate 2
>    without having ErrorLog be a piped log.
> 
> 2) On shutdown (via SIGTERM), a race condition occurs.  
>    reclaim_child_processes kills all of the httpd children (fine),
>    but the second half of the r_c_p call is a bit odd.  Basically,
>    it calls piped_log_maint with OC_REASON_DEATH - this triggers
>    piped_log_spawn to start up a new child since pl->program isn't 
>    NULL.  This is completely and utterly wrong.  We're supposed to
>    be shutting down, not starting up.  These new children never
>    receive the SIGTERM - they will stick around until another
>    SIGTERM occurs.  The old children have already quit due to the
>    SIGTERM, but the piped_log_spawn starts up new rotatelogs 
>    processes.
> 
> Take away the ErrorLog directive - no visible problems.  Don't have 
> any TransferLog directives?  No visible problem.  I think that the 
> ErrorLog code is broken and we're hitting some sort of signal race 
> condition in reclaim_child_processes for the TransferLog.  How are 
> these interacting?  I haven't a foggy clue.
> 
> Adding:
> 
> ErrorLog "| /foo/apache-1.3/bin/rotatelogs /foo/apache-1.3/logs/piped_error_log 
>86400"
> 
> causes the problem on Solaris.  You'll probably need a few TransferLog
> directives to cause the inadvertant fork().  When you send SIGTERM to
> the parent httpd process, new rotatelogs are created and hence the
> parent will never quit (these new children didn't receive the SIGTERM).
> Ooooops.
> 
> I've attached the following things to maybe help others understand:
> 
> 1) Truss output of parent httpd process receiving SIGTERM.
>    Notice the fork within the waitpid calls.  I later killed off
>    the rotatelogs and then sent SIGTERM again (it terminated).
> 2) gdb stack trace of reclaim_child_processes
> 3) ps listing showing that there is a rotatelogs with 1 as its
>    parent.
> 
> Am I right in my analysis?  Does this make any sense?  What is the 
> proper fix to this?  Can anyone reproduce this on non-Solaris
> machines?  -- justin

> poll(0xFFBEED48, 0, 1000)                     = 0
> time()                                                = 991198040
> poll(0xFFBEEB98, 7, 0)                                = 7
> waitid(P_ALL, 0, 0xFFBEECF0, WEXITED|WTRAPPED|WNOHANG) = 0
> poll(0xFFBEED48, 0, 1000)                     = 0
> time()                                                = 991198041
> waitid(P_ALL, 0, 0xFFBEECF0, WEXITED|WTRAPPED|WNOHANG) = 0
> poll(0xFFBEED48, 0, 1000)                     = 0
> time()                                                = 991198042
> waitid(P_ALL, 0, 0xFFBEECF0, WEXITED|WTRAPPED|WNOHANG) = 0
>     Received signal #15, SIGTERM, in poll() [caught]
>       siginfo: SIGTERM pid=12483 uid=7812
> poll(0xFFBEED48, 0, 1000)                     Err#4 EINTR
> setcontext(0xFFBEEA30)
> time()                                                = 991198043
> kill(-12454, SIGTERM)                         = 0
>     Received signal #15, SIGTERM [caught]
>       siginfo: SIGTERM pid=12454 uid=7812
> setcontext(0xFFBEEB38)
> poll(0xFFBEED28, 0, 17)                               = 0
> waitid(P_PID, 12473, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> waitid(P_PID, 12474, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> waitid(P_PID, 12475, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> waitid(P_PID, 12476, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> waitid(P_PID, 12477, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> waitid(P_PID, 12471, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12471, SIGTERM)                          = 0
> waitid(P_PID, 12469, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12469, SIGTERM)                          = 0
> waitid(P_PID, 12467, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12467, SIGTERM)                          = 0
> waitid(P_PID, 12465, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> fork()                                                = 12485
> waitid(P_PID, 12463, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> fork()                                                = 12487
> waitid(P_PID, 12461, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> fork()                                                = 12489
> waitid(P_PID, 12459, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> fork()                                                = 12491
> poll(0xFFBEED28, 0, 66)                               = 0
> waitid(P_PID, 12475, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> waitid(P_PID, 12476, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> waitid(P_PID, 12477, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> waitid(P_PID, 12491, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12491, SIGTERM)                          = 0
> waitid(P_PID, 12489, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12489, SIGTERM)                          = 0
> waitid(P_PID, 12487, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12487, SIGTERM)                          = 0
> waitid(P_PID, 12485, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12485, SIGTERM)                          = 0
> waitid(P_PID, 12471, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> waitid(P_PID, 12469, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> waitid(P_PID, 12467, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> poll(0xFFBEED28, 0, 263)                      = 0
> waitid(P_PID, 12491, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12491, SIGTERM)                          = 0
> waitid(P_PID, 12489, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12489, SIGTERM)                          = 0
> waitid(P_PID, 12487, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12487, SIGTERM)                          = 0
> waitid(P_PID, 12485, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12485, SIGTERM)                          = 0
> poll(0xFFBEED28, 0, 1049)     (sleeping...)
> poll(0xFFBEED28, 0, 1049)                     = 0
> waitid(P_PID, 12491, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12491, SIGTERM)                          = 0
> waitid(P_PID, 12489, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12489, SIGTERM)                          = 0
> waitid(P_PID, 12487, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12487, SIGTERM)                          = 0
> waitid(P_PID, 12485, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12485, SIGTERM)                          = 0
> poll(0xFFBEED28, 0, 4195)     (sleeping...)
> poll(0xFFBEED28, 0, 4195)                     = 0
> waitid(P_PID, 12491, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12491, SIGTERM)                          = 0
> waitid(P_PID, 12489, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12489, SIGTERM)                          = 0
> waitid(P_PID, 12487, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12487, SIGTERM)                          = 0
> waitid(P_PID, 12485, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12485, SIGTERM)                          = 0
> poll(0xFFBEED28, 0, 16778)    (sleeping...)
>     Received signal #15, SIGTERM, in poll() [caught]
>       siginfo: SIGTERM pid=12505 uid=7812
> poll(0xFFBEED28, 0, 16778)                    Err#4 EINTR
> setcontext(0xFFBEEA10)
> waitid(P_PID, 12491, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12491, SIGTERM)                          = 0
> waitid(P_PID, 12489, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12489, SIGTERM)                          = 0
> waitid(P_PID, 12487, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12487, SIGTERM)                          = 0
> waitid(P_PID, 12485, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12485, SIGTERM)                          = 0
> poll(0xFFBEED28, 0, 67109)    (sleeping...)
>     Received signal #15, SIGTERM, in poll() [caught]
>       siginfo: SIGTERM pid=27531 uid=7812
> poll(0xFFBEED28, 0, 67109)                    Err#4 EINTR
> setcontext(0xFFBEEA10)
> waitid(P_PID, 12491, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12491, SIGTERM)                          = 0
> waitid(P_PID, 12489, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12489, SIGTERM)                          = 0
> waitid(P_PID, 12487, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12487, SIGTERM)                          = 0
> waitid(P_PID, 12485, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> kill(12485, SIGTERM)                          = 0
> poll(0xFFBEED28, 0, 268436)   (sleeping...)
> poll(0xFFBEED28, 0, 268436)                   = 0
> waitid(P_PID, 12491, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> waitid(P_PID, 12489, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> waitid(P_PID, 12487, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> waitid(P_PID, 12485, 0xFFBEECD0, WEXITED|WTRAPPED|WNOHANG) = 0
> unlink("/home/jerenkrantz/apache-1.3/logs/httpd.pid") = 0
> getpid()                                      = 12454 [1]
> time()                                                = 991198337
> write(5, " [ T u e   M a y   2 9  ".., 66)    Err#32 EPIPE
>     Received signal #13, SIGPIPE [ignored]
> close(5)                                      = 0
> waitid(P_PID, 12457, 0xFFBEEAC0, WEXITED|WTRAPPED|WNOHANG) = 0
> close(21)                                     = 0
> close(20)                                     = 0
> close(18)                                     = 0
> close(19)                                     = 0
> close(16)                                     = 0
> close(17)                                     = 0
> close(13)                                     = 0
> close(14)                                     = 0
> close(11)                                     = 0
> close(12)                                     = 0
> close(9)                                      = 0
> close(10)                                     = 0
> close(7)                                      = 0
> close(8)                                      = 0
> close(3)                                      = 0
> close(6)                                      = 0
> close(15)                                     = 0
> llseek(0, 0, SEEK_CUR)                                = 0
> _exit(0)

> jerenkra 14112 14111  0 22:48:50 ?        0:00 
>/home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apache-1.3/logs/i
> jerenkra 14115 14113  0 22:48:50 ?        0:00 
>/home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apache-1.3/logs/i
> jerenkra 14118 14117  0 22:48:50 ?        0:00 
>/home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apache-1.3/logs/i
> jerenkra 14110 14108  0 22:48:50 ?        0:00 
>/home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apache-1.3/logs/i
> jerenkra 14111 14097  0 22:48:50 ?        0:00 /bin/sh -c  
>/home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apach
> jerenkra 14103 14102  0 22:48:50 ?        0:00 
>/home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apache-1.3/logs/i
> jerenkra 14107 14106  0 22:48:50 ?        0:00 
>/home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apache-1.3/logs/i
> jerenkra 14106 14097  0 22:48:50 ?        0:00 /bin/sh -c  
>/home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apach
> jerenkra 14108 14097  0 22:48:50 ?        0:00 /bin/sh -c  
>/home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apach
> jerenkra 14104 14097  0 22:48:50 ?        0:00 /bin/sh -c  
>/home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apach
> jerenkra 14114 14097  0 22:48:50 ?        0:00 /bin/sh -c  
>/home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apach
> jerenkra 14127 13768  0 22:49:20 pts/5    0:00 grep rotate
> jerenkra 14116 14114  0 22:48:50 ?        0:00 
>/home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apache-1.3/logs/i
> jerenkra 14117 14097  0 22:48:50 ?        0:00 /bin/sh -c  
>/home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apach
> jerenkra 14113 14097  0 22:48:50 ?        0:00 /bin/sh -c  
>/home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apach
> jerenkra 14082     1  0 22:48:47 pts/5    0:00 
>/home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apache-1.3/logs/i
> jerenkra 14102 14097  0 22:48:50 ?        0:00 /bin/sh -c  
>/home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apach
> jerenkra 14105 14104  0 22:48:50 ?        0:00 
>/home/jerenkrantz/apache-1.3/bin/rotatelogs /home/jerenkrantz/apache-1.3/logs/i

> #0  piped_log_spawn (pl=0xce420) at http_log.c:611
> #1  0x4c190 in piped_log_maintenance (reason=0, data=0xce420, status=13) at 
>http_log.c:658
> #2  0x4e85c in reclaim_child_processes (terminate=1) at http_main.c:2495
> #3  0x51bd8 in standalone_main (argc=1, argv=0xffbeefe4) at http_main.c:4779
> #4  0x52484 in main (argc=1, argv=0xffbeefe4) at http_main.c:5010

Reply via email to