On 4/14/22 3:43 PM, Yann Ylavic wrote:
> On Thu, Apr 14, 2022 at 1:44 PM Ruediger Pluem <rpl...@apache.org> wrote:
>>
>> On 4/14/22 1:16 PM, Yann Ylavic wrote:
>>> On Thu, Apr 14, 2022 at 12:09 PM Ruediger Pluem <rpl...@apache.org> wrote:
>>>>
>>>> On 4/14/22 11:52 AM, Yann Ylavic wrote:
>>>>>
>>>>> Any signal that killed the process would mean something unexpected
>>>>> happened since we clean_child_exit() otherwise?
>>>>
>>>> Hm. You mean that the case for
>>>>
>>>> case SIGTERM:
>>>> case SIGHUP:
>>>> case AP_SIG_GRACEFUL:
>>>>
>>>> in ap_process_child_status
>>>> never triggers as the child catches these and does a clean_child_exit()?
>>>
>>> Yes
>>>
>>>>
>>>> If yes, the above seems to be a simple solution, but it would also capture
>>>> SIGKILL which might
>>>> have been issued by our parent. Does this matter?
>>>
>>> I think our SIGKILLs (for ungraceful restart) are "captured" by
>>> ap_reclaim_child_processes(), that is outside server_main_loop() so it
>>> should be OK..
>>
>> Good point. Then I am fine with your proposed patch.
>
> Thinking more about it, I came to something like this:
>
> Index: server/mpm/event/event.c
> ===================================================================
> --- server/mpm/event/event.c (revision 1899818)
> +++ server/mpm/event/event.c (working copy)
> @@ -3382,6 +3382,7 @@ static void server_main_loop(int remaining_childre
> {
> int num_buckets = retained->mpm->num_buckets;
> int max_daemon_used = 0;
> + int successive_children_signaled = 0;
> int child_slot;
> apr_exit_why_e exitwhy;
> int status, processed_status;
> @@ -3460,11 +3461,26 @@ static void server_main_loop(int remaining_childre
> /* Don't perform idle maintenance when a child dies,
> * only do it when there's a timeout. Remember only a
> * finite number of children can die, and it's pretty
> - * pathological for a lot to die suddenly.
> + * pathological for a lot to die suddenly. If that happens
> + * anyway, protect against pathological segfault flood by not
> + * restarting more than 3 children if no timeout happened in
> + * between, otherwise we let the usual maintenance go.
> */
> - continue;
> + if (child_slot < 0 || !APR_PROC_CHECK_SIGNALED(exitwhy)) {
> + continue;
> + }
> + if (++successive_children_signaled >= 3) {
> + apr_sleep(apr_time_from_sec(1));
Why can't we just continue here? We either have something to reap in the next
iteration which is fine or not where we would sleep
1 second anyway..
> + }
> + else {
> + remaining_children_to_start++;
> + }
> }
> - else if (remaining_children_to_start) {
> + else {
> + successive_children_signaled = 0;
> + }
> +
> + if (remaining_children_to_start) {
> /* we hit a 1 second timeout in which none of the previous
> * generation of children needed to be reaped... so assume
> * they're all done, and pick up the slack if any is left.
> --
>
> I have not tested it yet but possibly without this if many children
> segfault successively we might enter a busy fork() loop.
> The above would restart killed children immediately (using the
> remaining_children_to_start mechanism) only if there are less than 3
> successively, otherwise sleep 1s and
> perform_idle_server_maintenance().
>
> WDYT?
Having a stopper for the fork loop sounds sensible. We may should log something
if we hit it to ease analysis for situations where
the childs quickly segfault.
Regards
RĂ¼diger