In order to reproduce the logs I see on this weird behaviour, I'll attach the patch I made:
event_debug.diff
Description: Binary data
With this, I see in a hanging process: [Wed Jul 07 15:20:23.014980 2021] [core:trace1] [pid 42471:tid 123145448968192] core_filters.c(429): (32)Broken pipe: [client 127.0.0.1:49824] core_output_filter: writing data to the network [Wed Jul 07 15:20:23.044541 2021] [mpm_event:debug] [pid 42471:tid 4525448704] event.c(599): AH: wakeup_listener: start [Wed Jul 07 15:20:23.044601 2021] [mpm_event:debug] [pid 42471:tid 4525448704] event.c(603): AH: wakeup_listener: apr_pollset_wakeup [Wed Jul 07 15:20:23.044715 2021] [mpm_event:debug] [pid 42471:tid 4525448704] event.c(610): AH: wakeup_listener: ap_queue_info_term [Wed Jul 07 15:20:23.044861 2021] [mpm_event:debug] [pid 42471:tid 123145461309440] event.c(1985): (4)Interrupted system call: AH: pollset returned listener_may_exit=0 [Wed Jul 07 15:20:23.044817 2021] [mpm_event:debug] [pid 42471:tid 4525448704] event.c(629): AH: wakeup_listener: pthread_kill [Wed Jul 07 15:20:23.044954 2021] [mpm_event:trace1] [pid 42471:tid 123145461309440] event.c(1809): All workers are busy or dying, will close 1 keep-alive connections [Wed Jul 07 15:20:23.044987 2021] [mpm_event:debug] [pid 42471:tid 4525448704] event.c(635): AH: wakeup_listener: end [Wed Jul 07 15:20:23.045154 2021] [mpm_event:trace1] [pid 42471:tid 4525448704] event.c(2940): graceful termination received, join workers [Wed Jul 07 15:20:25.927437 2021] [mpm_event:debug] [pid 42471:tid 123145461309440] event.c(1985): (70007)The timeout specified has expired: AH: pollset returned listener_may_exit=0 [Wed Jul 07 15:20:25.927468 2021] [mpm_event:trace1] [pid 42471:tid 123145461309440] event.c(1809): All workers are busy or dying, will close 0 keep-alive connections Which says (to my eyes) that the pollset_wakeup happened, but the listener saw "listener_may_exit == 0". Is the volatile maybe not enough? > Am 07.07.2021 um 16:56 schrieb Stefan Eissing <stefan.eiss...@greenbytes.de>: > > > >> Am 07.07.2021 um 13:51 schrieb Yann Ylavic <ylavic....@gmail.com>: >> >> On Wed, Jul 7, 2021 at 12:16 PM Stefan Eissing >> <stefan.eiss...@greenbytes.de> wrote: >>> >>> I added a TRACE1 log in event.c before/after join_workers (line 2921) and >>> see: >>> >>> [Wed Jul 07 10:06:03.144044 2021] [mpm_event:trace1] [pid 72886:tid >>> 4493635072] event.c(2921): graceful termination received, join workers >>> [Wed Jul 07 10:06:03.144213 2021] [mpm_event:trace1] [pid 72886:tid >>> 123145435672576] event.c(1799): All workers are busy or dying, will close 1 >>> keep-alive connections >>> [Wed Jul 07 10:06:08.079690 2021] [mpm_event:trace1] [pid 72886:tid >>> 123145435672576] event.c(1799): All workers are busy or dying, will close 0 >>> keep-alive connections >>> [Wed Jul 07 10:06:10.787777 2021] [core:warn] [pid 72813:tid 4493635072] >>> AH00045: child process 72886 still did not exit, sending a SIGTERM >>> [Wed Jul 07 10:06:12.789335 2021] [core:warn] [pid 72813:tid 4493635072] >>> AH00045: child process 72886 still did not exit, sending a SIGTERM >>> [Wed Jul 07 10:06:14.791281 2021] [core:warn] [pid 72813:tid 4493635072] >>> AH00045: child process 72886 still did not exit, sending a SIGTERM >>> [Wed Jul 07 10:06:16.792983 2021] [core:error] [pid 72813:tid 4493635072] >>> AH00046: child process 72886 still did not exit, sending a SIGKILL >>> >>> So, I assume the keep-alive connection is the mod_proxy_http connection to >>> the sever itself. Since the join_workers() never returns, there seems to be >>> a thread not finishing. >> >> No, MPM event doesn't handle mod_proxy kept alive connections, those >> are client connections. >> >>> >>> On another run, I got a stacktrace of the child: >>> >>> Call graph: >>> 264 Thread_36766542 DispatchQueue_1: com.apple.main-thread (serial) >>> + 264 start (in libdyld.dylib) + 1 [0x7fff2032cf5d] >>> + 264 main (in httpd) + 2278 [0x104729b86] main.c:862 >>> + 264 ap_run_mpm (in httpd) + 75 [0x10473917b] mpm_common.c:100 >>> + 264 event_run (in mod_mpm_event.so) + 2994 [0x10502ae62] >>> event.c:3398 >>> + 264 make_child (in mod_mpm_event.so) + 436 [0x10502bbc4] >>> event.c:2997 >>> + 264 child_main (in mod_mpm_event.so) + 1734 [0x10502c2e6] >>> event.c:2924 >>> + 264 join_workers (in mod_mpm_event.so) + 386 >>> [0x10502cc72] event.c:2717 >>> + 264 apr_thread_join (in libapr-2.0.dylib) + 44 >>> [0x1048b347c] thread.c:256 >>> + 264 _pthread_join (in libsystem_pthread.dylib) + 362 >>> [0x7fff20312f60] >>> + 264 __ulock_wait (in libsystem_kernel.dylib) + 10 >>> [0x7fff202dd9ee] >> >> OK, that's where the child waits for worker threads. >> >>> 264 Thread_36766548 >>> + 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443] >>> + 264 _pthread_start (in libsystem_pthread.dylib) + 224 >>> [0x7fff203118fc] >>> + 264 dummy_worker (in libapr-2.0.dylib) + 30 [0x1048b33ee] >>> thread.c:148 >>> + 264 slot_run (in mod_http2.so) + 189 [0x104f99a8d] >>> h2_workers.c:260 >>> + 264 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 >>> [0x7fff20311e49] >>> + 264 __psynch_cvwait (in libsystem_kernel.dylib) + 10 >>> [0x7fff202decde] >>> 264 Thread_36766549 >>> + 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443] >>> + 264 _pthread_start (in libsystem_pthread.dylib) + 224 >>> [0x7fff203118fc] >>> + 264 dummy_worker (in libapr-2.0.dylib) + 30 [0x1048b33ee] >>> thread.c:148 >>> + 264 slot_run (in mod_http2.so) + 189 [0x104f99a8d] >>> h2_workers.c:260 >>> + 264 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 >>> [0x7fff20311e49] >>> + 264 __psynch_cvwait (in libsystem_kernel.dylib) + 10 >>> [0x7fff202decde] >>> 264 Thread_36766550 >>> + 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443] >>> + 264 _pthread_start (in libsystem_pthread.dylib) + 224 >>> [0x7fff203118fc] >>> + 264 dummy_worker (in libapr-2.0.dylib) + 30 [0x1048b33ee] >>> thread.c:148 >>> + 264 slot_run (in mod_http2.so) + 189 [0x104f99a8d] >>> h2_workers.c:260 >>> + 264 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 >>> [0x7fff20311e49] >>> + 264 __psynch_cvwait (in libsystem_kernel.dylib) + 10 >>> [0x7fff202decde] >>> 264 Thread_36766551 >>> + 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443] >>> + 264 _pthread_start (in libsystem_pthread.dylib) + 224 >>> [0x7fff203118fc] >>> + 264 dummy_worker (in libapr-2.0.dylib) + 30 [0x1048b33ee] >>> thread.c:148 >>> + 264 slot_run (in mod_http2.so) + 189 [0x104f99a8d] >>> h2_workers.c:260 >>> + 264 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 >>> [0x7fff20311e49] >>> + 264 __psynch_cvwait (in libsystem_kernel.dylib) + 10 >>> [0x7fff202decde] >> >> The above mod_http2 threads are not accounted for by the MPM I suppose >> (they should be stopped by the future child_exit hook, but still the >> MPM will only care about its own worker threads, i.e. those running >> the process_connection hook). > > Yes, the h2 worker thread currently get cleaned up on the pool pre cleanup. > Those are ok to see as that has not happened yet. > >> >>> 264 Thread_36766565 >>> + 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443] >>> + 264 _pthread_start (in libsystem_pthread.dylib) + 224 >>> [0x7fff203118fc] >>> + 264 dummy_worker (in libapr-2.0.dylib) + 30 [0x1048b33ee] >>> thread.c:148 >>> + 264 worker_thread (in mod_mpm_event.so) + 380 [0x10502cf2c] >>> event.c:2340 >>> + 264 ap_queue_pop_something (in httpd) + 142 [0x10473cf3e] >>> mpm_fdqueue.c:447 >>> + 264 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 >>> [0x7fff20311e49] >>> + 264 __psynch_cvwait (in libsystem_kernel.dylib) + 10 >>> [0x7fff202decde] >> >> This worker didn't get the signal to stop from ap_queue_term().. >> >>> 264 Thread_36766579 >>> 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443] >>> 264 _pthread_start (in libsystem_pthread.dylib) + 224 >>> [0x7fff203118fc] >>> 264 dummy_worker (in libapr-2.0.dylib) + 30 [0x1048b33ee] >>> thread.c:148 >>> 264 listener_thread (in mod_mpm_event.so) + 1129 [0x10502e069] >>> event.c:1940 >>> 264 impl_pollset_poll (in libapr-2.0.dylib) + 114 >>> [0x1048aec82] kqueue.c:272 >>> 264 kevent (in libsystem_kernel.dylib) + 10 [0x7fff202e0c4a] >> >> .. and that's because the listener didn't exit either, hmm. >> >> Your system sets listener_is_wakeable (AH02471), right? > > once at process start: > [Wed Jul 07 10:11:32.368594 2021] [mpm_event:debug] [pid 73128:tid > 123145437016064] event.c(2611): AH02471: start_threads: Using kqueue > (wakeable) > >> Does wakeup_listener() get called for the graceful restart that hangs? >> >> Cheers; >> Yann. > > Inserted some debug logging into event.c (and of course, the more I insert, > the less the likelihood of it happening): > Also, I had apr trunk in srclib and switched to branches/1.7.x - same > behaviour. > > End of log for a hanging process: > > [Wed Jul 07 14:35:12.300919 2021] [mpm_event:debug] [pid 80258:tid > 4439064064] event.c(599): AH: wakeup_listener: start > [Wed Jul 07 14:35:12.300938 2021] [mpm_event:debug] [pid 80258:tid > 4439064064] event.c(603): AH: wakeup_listener: apr_pollset_wakeup > [Wed Jul 07 14:35:12.300952 2021] [mpm_event:debug] [pid 80258:tid > 4439064064] event.c(610): AH: wakeup_listener: ap_queue_info_term > [Wed Jul 07 14:35:12.301082 2021] [mpm_event:trace1] [pid 80258:tid > 123145536962560] event.c(1809): All workers are busy or dying, will close 1 > keep-alive connections > [Wed Jul 07 14:35:12.301194 2021] [mpm_event:debug] [pid 80258:tid > 4439064064] event.c(629): AH: wakeup_listener: pthread_kill > [Wed Jul 07 14:35:12.301433 2021] [mpm_event:debug] [pid 80258:tid > 4439064064] event.c(635): AH: wakeup_listener: end > [Wed Jul 07 14:35:12.301444 2021] [mpm_event:trace1] [pid 80258:tid > 4439064064] event.c(2933): graceful termination received, join workers > [Wed Jul 07 14:35:16.501276 2021] [mpm_event:trace1] [pid 80258:tid > 123145536962560] event.c(1809): All workers are busy or dying, will close 0 > keep-alive connections > > macOS process analysis: > Call graph: > 2942 Thread_36817225 DispatchQueue_1: com.apple.main-thread (serial) > + 2942 start (in libdyld.dylib) + 1 [0x7fff2032cf5d] > + 2942 main (in httpd) + 2278 [0x107ec3b86] main.c:862 > + 2942 ap_run_mpm (in httpd) + 75 [0x107ed317b] mpm_common.c:100 > + 2942 event_run (in mod_mpm_event.so) + 2994 [0x1087c4b72] > event.c:3410 > + 2942 make_child (in mod_mpm_event.so) + 436 [0x1087c58d4] > event.c:3009 > + 2942 child_main (in mod_mpm_event.so) + 1734 [0x1087c5ff6] > event.c:2936 > + 2942 join_workers (in mod_mpm_event.so) + 197 > [0x1087c67e5] event.c:2729 > + 2942 apr_thread_join (in libapr-2.0.dylib) + 44 > [0x10804d47c] thread.c:256 > + 2942 _pthread_join (in libsystem_pthread.dylib) + 362 > [0x7fff20312f60] > + 2942 __ulock_wait (in libsystem_kernel.dylib) + 10 > [0x7fff202dd9ee] > 2942 Thread_36817227 > + 2942 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443] > + 2942 _pthread_start (in libsystem_pthread.dylib) + 224 > [0x7fff203118fc] > + 2942 dummy_worker (in libapr-2.0.dylib) + 30 [0x10804d3ee] > thread.c:148 > + 2942 slot_run (in mod_http2.so) + 189 [0x108733a8d] > h2_workers.c:260 > + 2942 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 > [0x7fff20311e49] > + 2942 __psynch_cvwait (in libsystem_kernel.dylib) + 10 > [0x7fff202decde] > 2942 Thread_36817228 > + 2942 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443] > + 2942 _pthread_start (in libsystem_pthread.dylib) + 224 > [0x7fff203118fc] > + 2942 dummy_worker (in libapr-2.0.dylib) + 30 [0x10804d3ee] > thread.c:148 > + 2942 slot_run (in mod_http2.so) + 189 [0x108733a8d] > h2_workers.c:260 > + 2942 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 > [0x7fff20311e49] > + 2942 __psynch_cvwait (in libsystem_kernel.dylib) + 10 > [0x7fff202decde] > 2942 Thread_36817229 > + 2942 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443] > + 2942 _pthread_start (in libsystem_pthread.dylib) + 224 > [0x7fff203118fc] > + 2942 dummy_worker (in libapr-2.0.dylib) + 30 [0x10804d3ee] > thread.c:148 > + 2942 slot_run (in mod_http2.so) + 189 [0x108733a8d] > h2_workers.c:260 > + 2942 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 > [0x7fff20311e49] > + 2942 __psynch_cvwait (in libsystem_kernel.dylib) + 10 > [0x7fff202decde] > 2942 Thread_36817230 > + 2942 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443] > + 2942 _pthread_start (in libsystem_pthread.dylib) + 224 > [0x7fff203118fc] > + 2942 dummy_worker (in libapr-2.0.dylib) + 30 [0x10804d3ee] > thread.c:148 > + 2942 slot_run (in mod_http2.so) + 189 [0x108733a8d] > h2_workers.c:260 > + 2942 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 > [0x7fff20311e49] > + 2942 __psynch_cvwait (in libsystem_kernel.dylib) + 10 > [0x7fff202decde] > 2942 Thread_36817254 > + 2942 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443] > + 2942 _pthread_start (in libsystem_pthread.dylib) + 224 > [0x7fff203118fc] > + 2942 dummy_worker (in libapr-2.0.dylib) + 30 [0x10804d3ee] > thread.c:148 > + 2942 worker_thread (in mod_mpm_event.so) + 300 [0x1087c6a4c] > event.c:2350 > + 2942 ap_queue_pop_something (in httpd) + 142 [0x107ed6f3e] > mpm_fdqueue.c:447 > + 2942 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 > [0x7fff20311e49] > + 2942 __psynch_cvwait (in libsystem_kernel.dylib) + 10 > [0x7fff202decde] > 2942 Thread_36817257 > 2942 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443] > 2942 _pthread_start (in libsystem_pthread.dylib) + 224 > [0x7fff203118fc] > 2942 dummy_worker (in libapr-2.0.dylib) + 30 [0x10804d3ee] > thread.c:148 > 2942 listener_thread (in mod_mpm_event.so) + 1148 [0x1087c7c2c] > event.c:1950 > 2942 impl_pollset_poll (in libapr-2.0.dylib) + 114 > [0x108048c82] kqueue.c:272 > 2942 kevent (in libsystem_kernel.dylib) + 10 [0x7fff202e0c4a] > > Total number in stack (recursive counted multiple, when >=5): > 6 _pthread_start (in libsystem_pthread.dylib) + 224 > [0x7fff203118fc] > 6 dummy_worker (in libapr-2.0.dylib) + 30 [0x10804d3ee] > thread.c:148 > 6 thread_start (in libsystem_pthread.dylib) + 15 > [0x7fff2030d443] > 5 __psynch_cvwait (in libsystem_kernel.dylib) + 0 > [0x7fff202decd4] > 5 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 > [0x7fff20311e49] > > Sort by top of stack, same collapsed (when >= 5): > __psynch_cvwait (in libsystem_kernel.dylib) 14710 > __ulock_wait (in libsystem_kernel.dylib) 2942 > kevent (in libsystem_kernel.dylib) 2942