In order to reproduce the logs I see on this weird behaviour, I'll attach the 
patch I made:

Attachment: event_debug.diff
Description: Binary data


With this, I see in a hanging process:

[Wed Jul 07 15:20:23.014980 2021] [core:trace1] [pid 42471:tid 123145448968192] 
core_filters.c(429): (32)Broken pipe: [client 127.0.0.1:49824] 
core_output_filter: writing data to the network
[Wed Jul 07 15:20:23.044541 2021] [mpm_event:debug] [pid 42471:tid 4525448704] 
event.c(599): AH: wakeup_listener: start
[Wed Jul 07 15:20:23.044601 2021] [mpm_event:debug] [pid 42471:tid 4525448704] 
event.c(603): AH: wakeup_listener: apr_pollset_wakeup
[Wed Jul 07 15:20:23.044715 2021] [mpm_event:debug] [pid 42471:tid 4525448704] 
event.c(610): AH: wakeup_listener: ap_queue_info_term
[Wed Jul 07 15:20:23.044861 2021] [mpm_event:debug] [pid 42471:tid 
123145461309440] event.c(1985): (4)Interrupted system call: AH: pollset 
returned listener_may_exit=0
[Wed Jul 07 15:20:23.044817 2021] [mpm_event:debug] [pid 42471:tid 4525448704] 
event.c(629): AH: wakeup_listener: pthread_kill
[Wed Jul 07 15:20:23.044954 2021] [mpm_event:trace1] [pid 42471:tid 
123145461309440] event.c(1809): All workers are busy or dying, will close 1 
keep-alive connections
[Wed Jul 07 15:20:23.044987 2021] [mpm_event:debug] [pid 42471:tid 4525448704] 
event.c(635): AH: wakeup_listener: end
[Wed Jul 07 15:20:23.045154 2021] [mpm_event:trace1] [pid 42471:tid 4525448704] 
event.c(2940): graceful termination received, join workers
[Wed Jul 07 15:20:25.927437 2021] [mpm_event:debug] [pid 42471:tid 
123145461309440] event.c(1985): (70007)The timeout specified has expired: AH: 
pollset returned listener_may_exit=0
[Wed Jul 07 15:20:25.927468 2021] [mpm_event:trace1] [pid 42471:tid 
123145461309440] event.c(1809): All workers are busy or dying, will close 0 
keep-alive connections

Which says (to my eyes) that the pollset_wakeup happened, but the listener saw 
"listener_may_exit == 0". Is the volatile maybe not enough?


> Am 07.07.2021 um 16:56 schrieb Stefan Eissing <stefan.eiss...@greenbytes.de>:
> 
> 
> 
>> Am 07.07.2021 um 13:51 schrieb Yann Ylavic <ylavic....@gmail.com>:
>> 
>> On Wed, Jul 7, 2021 at 12:16 PM Stefan Eissing
>> <stefan.eiss...@greenbytes.de> wrote:
>>> 
>>> I added a TRACE1 log in event.c before/after join_workers (line 2921) and 
>>> see:
>>> 
>>> [Wed Jul 07 10:06:03.144044 2021] [mpm_event:trace1] [pid 72886:tid 
>>> 4493635072] event.c(2921): graceful termination received, join workers
>>> [Wed Jul 07 10:06:03.144213 2021] [mpm_event:trace1] [pid 72886:tid 
>>> 123145435672576] event.c(1799): All workers are busy or dying, will close 1 
>>> keep-alive connections
>>> [Wed Jul 07 10:06:08.079690 2021] [mpm_event:trace1] [pid 72886:tid 
>>> 123145435672576] event.c(1799): All workers are busy or dying, will close 0 
>>> keep-alive connections
>>> [Wed Jul 07 10:06:10.787777 2021] [core:warn] [pid 72813:tid 4493635072] 
>>> AH00045: child process 72886 still did not exit, sending a SIGTERM
>>> [Wed Jul 07 10:06:12.789335 2021] [core:warn] [pid 72813:tid 4493635072] 
>>> AH00045: child process 72886 still did not exit, sending a SIGTERM
>>> [Wed Jul 07 10:06:14.791281 2021] [core:warn] [pid 72813:tid 4493635072] 
>>> AH00045: child process 72886 still did not exit, sending a SIGTERM
>>> [Wed Jul 07 10:06:16.792983 2021] [core:error] [pid 72813:tid 4493635072] 
>>> AH00046: child process 72886 still did not exit, sending a SIGKILL
>>> 
>>> So, I assume the keep-alive connection is the mod_proxy_http connection to 
>>> the sever itself. Since the join_workers() never returns, there seems to be 
>>> a thread not finishing.
>> 
>> No, MPM event doesn't handle mod_proxy kept alive connections, those
>> are client connections.
>> 
>>> 
>>> On another run, I got a stacktrace of the child:
>>> 
>>> Call graph:
>>>   264 Thread_36766542   DispatchQueue_1: com.apple.main-thread  (serial)
>>>   + 264 start  (in libdyld.dylib) + 1  [0x7fff2032cf5d]
>>>   +   264 main  (in httpd) + 2278  [0x104729b86]  main.c:862
>>>   +     264 ap_run_mpm  (in httpd) + 75  [0x10473917b]  mpm_common.c:100
>>>   +       264 event_run  (in mod_mpm_event.so) + 2994  [0x10502ae62]  
>>> event.c:3398
>>>   +         264 make_child  (in mod_mpm_event.so) + 436  [0x10502bbc4]  
>>> event.c:2997
>>>   +           264 child_main  (in mod_mpm_event.so) + 1734  [0x10502c2e6]  
>>> event.c:2924
>>>   +             264 join_workers  (in mod_mpm_event.so) + 386  
>>> [0x10502cc72]  event.c:2717
>>>   +               264 apr_thread_join  (in libapr-2.0.dylib) + 44  
>>> [0x1048b347c]  thread.c:256
>>>   +                 264 _pthread_join  (in libsystem_pthread.dylib) + 362  
>>> [0x7fff20312f60]
>>>   +                   264 __ulock_wait  (in libsystem_kernel.dylib) + 10  
>>> [0x7fff202dd9ee]
>> 
>> OK, that's where the child waits for worker threads.
>> 
>>>   264 Thread_36766548
>>>   + 264 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff2030d443]
>>>   +   264 _pthread_start  (in libsystem_pthread.dylib) + 224  
>>> [0x7fff203118fc]
>>>   +     264 dummy_worker  (in libapr-2.0.dylib) + 30  [0x1048b33ee]  
>>> thread.c:148
>>>   +       264 slot_run  (in mod_http2.so) + 189  [0x104f99a8d]  
>>> h2_workers.c:260
>>>   +         264 _pthread_cond_wait  (in libsystem_pthread.dylib) + 1298  
>>> [0x7fff20311e49]
>>>   +           264 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  
>>> [0x7fff202decde]
>>>   264 Thread_36766549
>>>   + 264 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff2030d443]
>>>   +   264 _pthread_start  (in libsystem_pthread.dylib) + 224  
>>> [0x7fff203118fc]
>>>   +     264 dummy_worker  (in libapr-2.0.dylib) + 30  [0x1048b33ee]  
>>> thread.c:148
>>>   +       264 slot_run  (in mod_http2.so) + 189  [0x104f99a8d]  
>>> h2_workers.c:260
>>>   +         264 _pthread_cond_wait  (in libsystem_pthread.dylib) + 1298  
>>> [0x7fff20311e49]
>>>   +           264 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  
>>> [0x7fff202decde]
>>>   264 Thread_36766550
>>>   + 264 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff2030d443]
>>>   +   264 _pthread_start  (in libsystem_pthread.dylib) + 224  
>>> [0x7fff203118fc]
>>>   +     264 dummy_worker  (in libapr-2.0.dylib) + 30  [0x1048b33ee]  
>>> thread.c:148
>>>   +       264 slot_run  (in mod_http2.so) + 189  [0x104f99a8d]  
>>> h2_workers.c:260
>>>   +         264 _pthread_cond_wait  (in libsystem_pthread.dylib) + 1298  
>>> [0x7fff20311e49]
>>>   +           264 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  
>>> [0x7fff202decde]
>>>   264 Thread_36766551
>>>   + 264 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff2030d443]
>>>   +   264 _pthread_start  (in libsystem_pthread.dylib) + 224  
>>> [0x7fff203118fc]
>>>   +     264 dummy_worker  (in libapr-2.0.dylib) + 30  [0x1048b33ee]  
>>> thread.c:148
>>>   +       264 slot_run  (in mod_http2.so) + 189  [0x104f99a8d]  
>>> h2_workers.c:260
>>>   +         264 _pthread_cond_wait  (in libsystem_pthread.dylib) + 1298  
>>> [0x7fff20311e49]
>>>   +           264 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  
>>> [0x7fff202decde]
>> 
>> The above mod_http2 threads are not accounted for by the MPM I suppose
>> (they should be stopped by the future child_exit hook, but still the
>> MPM will only care about its own worker threads, i.e. those running
>> the process_connection hook).
> 
> Yes, the h2 worker thread currently get cleaned up on the pool pre cleanup. 
> Those are ok to see as that has not happened yet.
> 
>> 
>>>   264 Thread_36766565
>>>   + 264 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff2030d443]
>>>   +   264 _pthread_start  (in libsystem_pthread.dylib) + 224  
>>> [0x7fff203118fc]
>>>   +     264 dummy_worker  (in libapr-2.0.dylib) + 30  [0x1048b33ee]  
>>> thread.c:148
>>>   +       264 worker_thread  (in mod_mpm_event.so) + 380  [0x10502cf2c]  
>>> event.c:2340
>>>   +         264 ap_queue_pop_something  (in httpd) + 142  [0x10473cf3e]  
>>> mpm_fdqueue.c:447
>>>   +           264 _pthread_cond_wait  (in libsystem_pthread.dylib) + 1298  
>>> [0x7fff20311e49]
>>>   +             264 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  
>>> [0x7fff202decde]
>> 
>> This worker didn't get the signal to stop from ap_queue_term()..
>> 
>>>   264 Thread_36766579
>>>     264 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff2030d443]
>>>       264 _pthread_start  (in libsystem_pthread.dylib) + 224  
>>> [0x7fff203118fc]
>>>         264 dummy_worker  (in libapr-2.0.dylib) + 30  [0x1048b33ee]  
>>> thread.c:148
>>>           264 listener_thread  (in mod_mpm_event.so) + 1129  [0x10502e069]  
>>> event.c:1940
>>>             264 impl_pollset_poll  (in libapr-2.0.dylib) + 114  
>>> [0x1048aec82]  kqueue.c:272
>>>               264 kevent  (in libsystem_kernel.dylib) + 10  [0x7fff202e0c4a]
>> 
>> .. and that's because the listener didn't exit either, hmm.
>> 
>> Your system sets listener_is_wakeable (AH02471), right?
> 
> once at process start:
> [Wed Jul 07 10:11:32.368594 2021] [mpm_event:debug] [pid 73128:tid 
> 123145437016064] event.c(2611): AH02471: start_threads: Using kqueue 
> (wakeable)
> 
>> Does wakeup_listener() get called for the graceful restart that hangs?
>> 
>> Cheers;
>> Yann.
> 
> Inserted some debug logging into event.c (and of course, the more I insert, 
> the less the likelihood of it happening):
> Also, I had apr trunk in srclib and switched to branches/1.7.x - same 
> behaviour.
> 
> End of log for a hanging process:
> 
> [Wed Jul 07 14:35:12.300919 2021] [mpm_event:debug] [pid 80258:tid 
> 4439064064] event.c(599): AH: wakeup_listener: start
> [Wed Jul 07 14:35:12.300938 2021] [mpm_event:debug] [pid 80258:tid 
> 4439064064] event.c(603): AH: wakeup_listener: apr_pollset_wakeup
> [Wed Jul 07 14:35:12.300952 2021] [mpm_event:debug] [pid 80258:tid 
> 4439064064] event.c(610): AH: wakeup_listener: ap_queue_info_term
> [Wed Jul 07 14:35:12.301082 2021] [mpm_event:trace1] [pid 80258:tid 
> 123145536962560] event.c(1809): All workers are busy or dying, will close 1 
> keep-alive connections
> [Wed Jul 07 14:35:12.301194 2021] [mpm_event:debug] [pid 80258:tid 
> 4439064064] event.c(629): AH: wakeup_listener: pthread_kill
> [Wed Jul 07 14:35:12.301433 2021] [mpm_event:debug] [pid 80258:tid 
> 4439064064] event.c(635): AH: wakeup_listener: end
> [Wed Jul 07 14:35:12.301444 2021] [mpm_event:trace1] [pid 80258:tid 
> 4439064064] event.c(2933): graceful termination received, join workers
> [Wed Jul 07 14:35:16.501276 2021] [mpm_event:trace1] [pid 80258:tid 
> 123145536962560] event.c(1809): All workers are busy or dying, will close 0 
> keep-alive connections
> 
> macOS process analysis:
> Call graph:
>    2942 Thread_36817225   DispatchQueue_1: com.apple.main-thread  (serial)
>    + 2942 start  (in libdyld.dylib) + 1  [0x7fff2032cf5d]
>    +   2942 main  (in httpd) + 2278  [0x107ec3b86]  main.c:862
>    +     2942 ap_run_mpm  (in httpd) + 75  [0x107ed317b]  mpm_common.c:100
>    +       2942 event_run  (in mod_mpm_event.so) + 2994  [0x1087c4b72]  
> event.c:3410
>    +         2942 make_child  (in mod_mpm_event.so) + 436  [0x1087c58d4]  
> event.c:3009
>    +           2942 child_main  (in mod_mpm_event.so) + 1734  [0x1087c5ff6]  
> event.c:2936
>    +             2942 join_workers  (in mod_mpm_event.so) + 197  
> [0x1087c67e5]  event.c:2729
>    +               2942 apr_thread_join  (in libapr-2.0.dylib) + 44  
> [0x10804d47c]  thread.c:256
>    +                 2942 _pthread_join  (in libsystem_pthread.dylib) + 362  
> [0x7fff20312f60]
>    +                   2942 __ulock_wait  (in libsystem_kernel.dylib) + 10  
> [0x7fff202dd9ee]
>    2942 Thread_36817227
>    + 2942 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff2030d443]
>    +   2942 _pthread_start  (in libsystem_pthread.dylib) + 224  
> [0x7fff203118fc]
>    +     2942 dummy_worker  (in libapr-2.0.dylib) + 30  [0x10804d3ee]  
> thread.c:148
>    +       2942 slot_run  (in mod_http2.so) + 189  [0x108733a8d]  
> h2_workers.c:260
>    +         2942 _pthread_cond_wait  (in libsystem_pthread.dylib) + 1298  
> [0x7fff20311e49]
>    +           2942 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  
> [0x7fff202decde]
>    2942 Thread_36817228
>    + 2942 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff2030d443]
>    +   2942 _pthread_start  (in libsystem_pthread.dylib) + 224  
> [0x7fff203118fc]
>    +     2942 dummy_worker  (in libapr-2.0.dylib) + 30  [0x10804d3ee]  
> thread.c:148
>    +       2942 slot_run  (in mod_http2.so) + 189  [0x108733a8d]  
> h2_workers.c:260
>    +         2942 _pthread_cond_wait  (in libsystem_pthread.dylib) + 1298  
> [0x7fff20311e49]
>    +           2942 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  
> [0x7fff202decde]
>    2942 Thread_36817229
>    + 2942 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff2030d443]
>    +   2942 _pthread_start  (in libsystem_pthread.dylib) + 224  
> [0x7fff203118fc]
>    +     2942 dummy_worker  (in libapr-2.0.dylib) + 30  [0x10804d3ee]  
> thread.c:148
>    +       2942 slot_run  (in mod_http2.so) + 189  [0x108733a8d]  
> h2_workers.c:260
>    +         2942 _pthread_cond_wait  (in libsystem_pthread.dylib) + 1298  
> [0x7fff20311e49]
>    +           2942 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  
> [0x7fff202decde]
>    2942 Thread_36817230
>    + 2942 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff2030d443]
>    +   2942 _pthread_start  (in libsystem_pthread.dylib) + 224  
> [0x7fff203118fc]
>    +     2942 dummy_worker  (in libapr-2.0.dylib) + 30  [0x10804d3ee]  
> thread.c:148
>    +       2942 slot_run  (in mod_http2.so) + 189  [0x108733a8d]  
> h2_workers.c:260
>    +         2942 _pthread_cond_wait  (in libsystem_pthread.dylib) + 1298  
> [0x7fff20311e49]
>    +           2942 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  
> [0x7fff202decde]
>    2942 Thread_36817254
>    + 2942 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff2030d443]
>    +   2942 _pthread_start  (in libsystem_pthread.dylib) + 224  
> [0x7fff203118fc]
>    +     2942 dummy_worker  (in libapr-2.0.dylib) + 30  [0x10804d3ee]  
> thread.c:148
>    +       2942 worker_thread  (in mod_mpm_event.so) + 300  [0x1087c6a4c]  
> event.c:2350
>    +         2942 ap_queue_pop_something  (in httpd) + 142  [0x107ed6f3e]  
> mpm_fdqueue.c:447
>    +           2942 _pthread_cond_wait  (in libsystem_pthread.dylib) + 1298  
> [0x7fff20311e49]
>    +             2942 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  
> [0x7fff202decde]
>    2942 Thread_36817257
>      2942 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff2030d443]
>        2942 _pthread_start  (in libsystem_pthread.dylib) + 224  
> [0x7fff203118fc]
>          2942 dummy_worker  (in libapr-2.0.dylib) + 30  [0x10804d3ee]  
> thread.c:148
>            2942 listener_thread  (in mod_mpm_event.so) + 1148  [0x1087c7c2c]  
> event.c:1950
>              2942 impl_pollset_poll  (in libapr-2.0.dylib) + 114  
> [0x108048c82]  kqueue.c:272
>                2942 kevent  (in libsystem_kernel.dylib) + 10  [0x7fff202e0c4a]
> 
> Total number in stack (recursive counted multiple, when >=5):
>        6       _pthread_start  (in libsystem_pthread.dylib) + 224  
> [0x7fff203118fc]
>        6       dummy_worker  (in libapr-2.0.dylib) + 30  [0x10804d3ee]  
> thread.c:148
>        6       thread_start  (in libsystem_pthread.dylib) + 15  
> [0x7fff2030d443]
>        5       __psynch_cvwait  (in libsystem_kernel.dylib) + 0  
> [0x7fff202decd4]
>        5       _pthread_cond_wait  (in libsystem_pthread.dylib) + 1298  
> [0x7fff20311e49]
> 
> Sort by top of stack, same collapsed (when >= 5):
>        __psynch_cvwait  (in libsystem_kernel.dylib)        14710
>        __ulock_wait  (in libsystem_kernel.dylib)        2942
>        kevent  (in libsystem_kernel.dylib)        2942

Reply via email to