On 20.02.2013 01:05, Rainer Jung wrote:
> Here's what I see concerning the graceful restart problem on Solaris.
> Setup using the prefork MPM with two http listeners. Accept mutex is
> pthread.
> 
> Short version: child processes that do not manage to acquire the accept
> mutex during graceful restart and before the next generation child
> processes get started will stay hanging in acquiring the accept mutex.
> 
> Long version of what happens when a graceful restart is issued:
> 
> 1) parent calls ap_mpm_pod_killpg for all (here: 6) children
>    This quickly produces 6 "OPTIONS *" requests.
> 2) First child accepts and processes one "OPTIONS *" request
>    and then exits
> 3) Second child gets the accept mutex and calls accept
> 4) Parent calls ap_mpm_safe_kill with AP_SIG_GRACEFUL for all
>    children pids. All children execute signal handler,
>    close the listening sockets and set die_now=1
> 5) Second child accepts and processes one
>    "OPTIONS *" and exits
> 6) Third child gets the accept mutex lock, sees die_now=1
>    unlocks the lock and exits
> 7) Three more children still wait for the accept mutex
> 8) parent starts next generation child processes
> 9) These new children wait for the accept mutex.
>    The mutex is now always acquired by one of the new children.
>    First thing they do is work on the remaining 4 "OPTIONS *"
>    requests. The remaining old children never get the accept mutex
>    and keep hanging.
> 
> What is strange to me: why isn't the GRACEFUL signal effective in
> interrupting the waiting for the accept mutex? Is that expected?

Aha: POSIX states: "If a signal is delivered to a thread waiting for a
mutex, upon return from the signal handler the thread shall resume
waiting for the mutex as if it was not interrupted."

So it is expected, that the signal does not interrupt waiting for the
accept mutex. Then I don't understand, how the above procedure can
reliably end the child processes.

> If I add a short delay between the "OPTIONS *" requests and the
> ap_mpm_safe_kill all old children process one of those requests and then
> set die_now to 1 because they see that there's a new generation. Then
> they actually exit.

Rainer

Reply via email to