Re: Faster start children after fatal signals?

2022-04-14 Thread Yann Ylavic
On Thu, Apr 14, 2022 at 4:00 PM Ruediger Pluem  wrote:
>
> On 4/14/22 3:43 PM, Yann Ylavic wrote:
> >
> > I have not tested it yet but possibly without this if many children
> > segfault successively we might enter a busy fork() loop.

Tested and it seems to work..

> > The above would restart killed children immediately (using the
> > remaining_children_to_start mechanism) only if there are less than 3
> > successively, otherwise sleep 1s and
> > perform_idle_server_maintenance().
> >
> > WDYT?
>
> Having a stopper for the fork loop sounds sensible. We may should log 
> something if we hit it to ease analysis for situations where
> the childs quickly segfault.

r1899858


Thanks;
Yann.


Re: Faster start children after fatal signals?

2022-04-14 Thread Ruediger Pluem



On 4/14/22 3:43 PM, Yann Ylavic wrote:
> On Thu, Apr 14, 2022 at 1:44 PM Ruediger Pluem  wrote:
>>
>> On 4/14/22 1:16 PM, Yann Ylavic wrote:
>>> On Thu, Apr 14, 2022 at 12:09 PM Ruediger Pluem  wrote:

 On 4/14/22 11:52 AM, Yann Ylavic wrote:
>
> Any signal that killed the process would mean something unexpected
> happened since we clean_child_exit() otherwise?

 Hm. You mean that the case for

 case SIGTERM:
 case SIGHUP:
 case AP_SIG_GRACEFUL:

 in ap_process_child_status
 never triggers as the child catches these and does a clean_child_exit()?
>>>
>>> Yes
>>>

 If yes, the above seems to be a simple solution, but it would also capture 
 SIGKILL which might
 have been issued by our parent. Does this matter?
>>>
>>> I think our SIGKILLs (for ungraceful restart) are "captured" by
>>> ap_reclaim_child_processes(), that is outside server_main_loop() so it
>>> should be OK..
>>
>> Good point. Then I am fine with your proposed patch.
> 
> Thinking more about it, I came to something like this:
> 
> Index: server/mpm/event/event.c
> ===
> --- server/mpm/event/event.c(revision 1899818)
> +++ server/mpm/event/event.c(working copy)
> @@ -3382,6 +3382,7 @@ static void server_main_loop(int remaining_childre
>  {
>  int num_buckets = retained->mpm->num_buckets;
>  int max_daemon_used = 0;
> +int successive_children_signaled = 0;
>  int child_slot;
>  apr_exit_why_e exitwhy;
>  int status, processed_status;
> @@ -3460,11 +3461,26 @@ static void server_main_loop(int remaining_childre
>  /* Don't perform idle maintenance when a child dies,
>   * only do it when there's a timeout.  Remember only a
>   * finite number of children can die, and it's pretty
> - * pathological for a lot to die suddenly.
> + * pathological for a lot to die suddenly.  If that happens
> + * anyway, protect against pathological segfault flood by not
> + * restarting more than 3 children if no timeout happened in
> + * between, otherwise we let the usual maintenance go.
>   */
> -continue;
> +if (child_slot < 0 || !APR_PROC_CHECK_SIGNALED(exitwhy)) {
> +continue;
> +}
> +if (++successive_children_signaled >= 3) {
> +apr_sleep(apr_time_from_sec(1));

Why can't we just continue here? We either have something to reap in the next 
iteration which is fine or not where we would sleep
1 second anyway..

> +}
> +else {
> +remaining_children_to_start++;
> +}
>  }
> -else if (remaining_children_to_start) {
> +else {
> +successive_children_signaled = 0;
> +}
> +
> +if (remaining_children_to_start) {
>  /* we hit a 1 second timeout in which none of the previous
>   * generation of children needed to be reaped... so assume
>   * they're all done, and pick up the slack if any is left.
> --
> 
> I have not tested it yet but possibly without this if many children
> segfault successively we might enter a busy fork() loop.
> The above would restart killed children immediately (using the
> remaining_children_to_start mechanism) only if there are less than 3
> successively, otherwise sleep 1s and
> perform_idle_server_maintenance().
> 
> WDYT?

Having a stopper for the fork loop sounds sensible. We may should log something 
if we hit it to ease analysis for situations where
the childs quickly segfault.

Regards

Rüdiger



Re: Faster start children after fatal signals?

2022-04-14 Thread Yann Ylavic
On Thu, Apr 14, 2022 at 1:44 PM Ruediger Pluem  wrote:
>
> On 4/14/22 1:16 PM, Yann Ylavic wrote:
> > On Thu, Apr 14, 2022 at 12:09 PM Ruediger Pluem  wrote:
> >>
> >> On 4/14/22 11:52 AM, Yann Ylavic wrote:
> >>>
> >>> Any signal that killed the process would mean something unexpected
> >>> happened since we clean_child_exit() otherwise?
> >>
> >> Hm. You mean that the case for
> >>
> >> case SIGTERM:
> >> case SIGHUP:
> >> case AP_SIG_GRACEFUL:
> >>
> >> in ap_process_child_status
> >> never triggers as the child catches these and does a clean_child_exit()?
> >
> > Yes
> >
> >>
> >> If yes, the above seems to be a simple solution, but it would also capture 
> >> SIGKILL which might
> >> have been issued by our parent. Does this matter?
> >
> > I think our SIGKILLs (for ungraceful restart) are "captured" by
> > ap_reclaim_child_processes(), that is outside server_main_loop() so it
> > should be OK..
>
> Good point. Then I am fine with your proposed patch.

Thinking more about it, I came to something like this:

Index: server/mpm/event/event.c
===
--- server/mpm/event/event.c(revision 1899818)
+++ server/mpm/event/event.c(working copy)
@@ -3382,6 +3382,7 @@ static void server_main_loop(int remaining_childre
 {
 int num_buckets = retained->mpm->num_buckets;
 int max_daemon_used = 0;
+int successive_children_signaled = 0;
 int child_slot;
 apr_exit_why_e exitwhy;
 int status, processed_status;
@@ -3460,11 +3461,26 @@ static void server_main_loop(int remaining_childre
 /* Don't perform idle maintenance when a child dies,
  * only do it when there's a timeout.  Remember only a
  * finite number of children can die, and it's pretty
- * pathological for a lot to die suddenly.
+ * pathological for a lot to die suddenly.  If that happens
+ * anyway, protect against pathological segfault flood by not
+ * restarting more than 3 children if no timeout happened in
+ * between, otherwise we let the usual maintenance go.
  */
-continue;
+if (child_slot < 0 || !APR_PROC_CHECK_SIGNALED(exitwhy)) {
+continue;
+}
+if (++successive_children_signaled >= 3) {
+apr_sleep(apr_time_from_sec(1));
+}
+else {
+remaining_children_to_start++;
+}
 }
-else if (remaining_children_to_start) {
+else {
+successive_children_signaled = 0;
+}
+
+if (remaining_children_to_start) {
 /* we hit a 1 second timeout in which none of the previous
  * generation of children needed to be reaped... so assume
  * they're all done, and pick up the slack if any is left.
--

I have not tested it yet but possibly without this if many children
segfault successively we might enter a busy fork() loop.
The above would restart killed children immediately (using the
remaining_children_to_start mechanism) only if there are less than 3
successively, otherwise sleep 1s and
perform_idle_server_maintenance().

WDYT?

> >
> >>
> >> Do we need something for processes that die due to MaxConnectionsPerChild 
> >> or do we expect them to die at a slow path where the
> >> current approach is fine?
> >
> > Hm, good question. Restarting them if not necessary from a maintenance
> > metrics perspective is no better/worse than waiting for a maintenance
> > cycle, so I'd say let them be maintained as we do now..
> > Ideally (maybe) we'd pass a timeout to ap_wait_or_timeout() to make
> > sure that maintenance cycles really happen every 1s, regardless of
> > exited processes in the meantime (i.e. handle a "remaining" timeout
> > instead of the hardcoded 1s which currently may result in
> > 0.9s+waitpid()+0.9s+waitpid()+timeout thus here ~2s but possibly more
> > between two maintenance cycles).
>
> Hm, but ap_wait_or_timeout only waits if there is nothing to reap, which 
> would mean we would wait at max 1 s (+ processing time
> for multiple ap_wait_or_timeout calls) until we perform the idle server 
> maintenance, or do I miss your point?

Yeah, I should have looked at ap_wait_or_timeout() better, it's
actually an ap_wait_or_sleep()..


Regards;
Yann.


Re: Faster start children after fatal signals?

2022-04-14 Thread Ruediger Pluem



On 4/14/22 1:16 PM, Yann Ylavic wrote:
> On Thu, Apr 14, 2022 at 12:09 PM Ruediger Pluem  wrote:
>>
>> On 4/14/22 11:52 AM, Yann Ylavic wrote:
>>>
>>> Any signal that killed the process would mean something unexpected
>>> happened since we clean_child_exit() otherwise?
>>
>> Hm. You mean that the case for
>>
>> case SIGTERM:
>> case SIGHUP:
>> case AP_SIG_GRACEFUL:
>>
>> in ap_process_child_status
>> never triggers as the child catches these and does a clean_child_exit()?
> 
> Yes
> 
>>
>> If yes, the above seems to be a simple solution, but it would also capture 
>> SIGKILL which might
>> have been issued by our parent. Does this matter?
> 
> I think our SIGKILLs (for ungraceful restart) are "captured" by
> ap_reclaim_child_processes(), that is outside server_main_loop() so it
> should be OK..

Good point. Then I am fine with your proposed patch.
> 
>>
>> Do we need something for processes that die due to MaxConnectionsPerChild or 
>> do we expect them to die at a slow path where the
>> current approach is fine?
> 
> Hm, good question. Restarting them if not necessary from a maintenance
> metrics perspective is no better/worse than waiting for a maintenance
> cycle, so I'd say let them be maintained as we do now..
> Ideally (maybe) we'd pass a timeout to ap_wait_or_timeout() to make
> sure that maintenance cycles really happen every 1s, regardless of
> exited processes in the meantime (i.e. handle a "remaining" timeout
> instead of the hardcoded 1s which currently may result in
> 0.9s+waitpid()+0.9s+waitpid()+timeout thus here ~2s but possibly more
> between two maintenance cycles).

Hm, but ap_wait_or_timeout only waits if there is nothing to reap, which would 
mean we would wait at max 1 s (+ processing time
for multiple ap_wait_or_timeout calls) until we perform the idle server 
maintenance, or do I miss your point?


Regards

Rüdiger




Re: Faster start children after fatal signals?

2022-04-14 Thread Yann Ylavic
On Thu, Apr 14, 2022 at 12:09 PM Ruediger Pluem  wrote:
>
> On 4/14/22 11:52 AM, Yann Ylavic wrote:
> >
> > Any signal that killed the process would mean something unexpected
> > happened since we clean_child_exit() otherwise?
>
> Hm. You mean that the case for
>
> case SIGTERM:
> case SIGHUP:
> case AP_SIG_GRACEFUL:
>
> in ap_process_child_status
> never triggers as the child catches these and does a clean_child_exit()?

Yes

>
> If yes, the above seems to be a simple solution, but it would also capture 
> SIGKILL which might
> have been issued by our parent. Does this matter?

I think our SIGKILLs (for ungraceful restart) are "captured" by
ap_reclaim_child_processes(), that is outside server_main_loop() so it
should be OK..

>
> Do we need something for processes that die due to MaxConnectionsPerChild or 
> do we expect them to die at a slow path where the
> current approach is fine?

Hm, good question. Restarting them if not necessary from a maintenance
metrics perspective is no better/worse than waiting for a maintenance
cycle, so I'd say let them be maintained as we do now..
Ideally (maybe) we'd pass a timeout to ap_wait_or_timeout() to make
sure that maintenance cycles really happen every 1s, regardless of
exited processes in the meantime (i.e. handle a "remaining" timeout
instead of the hardcoded 1s which currently may result in
0.9s+waitpid()+0.9s+waitpid()+timeout thus here ~2s but possibly more
between two maintenance cycles).

>
> Regards
>
> Rüdiger


Re: Faster start children after fatal signals?

2022-04-14 Thread Ruediger Pluem



On 4/14/22 11:52 AM, Yann Ylavic wrote:
> On Thu, Apr 14, 2022 at 11:21 AM Ruediger Pluem  wrote:
>>
>> On 4/13/22 5:41 PM, Yann Ylavic wrote:
>>> On Wed, Apr 13, 2022 at 4:30 PM Ruediger Pluem  wrote:

 While looking at PR65769 I stumbled across the below in event.c (same in 
 worker.c)

 3460/* Don't perform idle maintenance when a child dies,
 3461 * only do it when there's a timeout.  Remember only a
 3462 * finite number of children can die, and it's pretty
 3463 * pathological for a lot to die suddenly.
 3464 */
 3465continue;

 In case several processes or even all die by a segfault we would wait 
 until we have processed all that processes plus one second
 until we start new processes. Shouldn't we perform an 
 perform_idle_server_maintenance in case the process died because of a fatal
 signal?
>>>
>>> +1
>>>
>>
>> In order to solve this I would like to add another APEXIT_ constant (e.g. 
>> APEXIT_FATAL_SIGNAL).
> 
> Maybe something like:
> 
> Index: server/mpm/event/event.c
> ===
> --- server/mpm/event/event.c(revision 1899818)
> +++ server/mpm/event/event.c(working copy)
> @@ -3462,7 +3462,9 @@ static void server_main_loop(int remaining_childre
>   * finite number of children can die, and it's pretty
>   * pathological for a lot to die suddenly.
>   */
> -continue;
> +if (!APR_PROC_CHECK_SIGNALED(exitwhy)) {
> +continue;
> +}
>  }
>  else if (remaining_children_to_start) {
>  /* we hit a 1 second timeout in which none of the previous
> --
> 
> could do?
> Any signal that killed the process would mean something unexpected
> happened since we clean_child_exit() otherwise?

Hm. You mean that the case for

case SIGTERM:
case SIGHUP:
case AP_SIG_GRACEFUL:

in ap_process_child_status
never triggers as the child catches these and does a clean_child_exit()?

If yes, the above seems to be a simple solution, but it would also capture 
SIGKILL which might
have been issued by our parent. Does this matter?

Do we need something for processes that die due to MaxConnectionsPerChild or do 
we expect them to die at a slow path where the
current approach is fine?

Regards

Rüdiger


Re: Faster start children after fatal signals?

2022-04-14 Thread Yann Ylavic
On Thu, Apr 14, 2022 at 11:21 AM Ruediger Pluem  wrote:
>
> On 4/13/22 5:41 PM, Yann Ylavic wrote:
> > On Wed, Apr 13, 2022 at 4:30 PM Ruediger Pluem  wrote:
> >>
> >> While looking at PR65769 I stumbled across the below in event.c (same in 
> >> worker.c)
> >>
> >> 3460/* Don't perform idle maintenance when a child dies,
> >> 3461 * only do it when there's a timeout.  Remember only a
> >> 3462 * finite number of children can die, and it's pretty
> >> 3463 * pathological for a lot to die suddenly.
> >> 3464 */
> >> 3465continue;
> >>
> >> In case several processes or even all die by a segfault we would wait 
> >> until we have processed all that processes plus one second
> >> until we start new processes. Shouldn't we perform an 
> >> perform_idle_server_maintenance in case the process died because of a fatal
> >> signal?
> >
> > +1
> >
>
> In order to solve this I would like to add another APEXIT_ constant (e.g. 
> APEXIT_FATAL_SIGNAL).

Maybe something like:

Index: server/mpm/event/event.c
===
--- server/mpm/event/event.c(revision 1899818)
+++ server/mpm/event/event.c(working copy)
@@ -3462,7 +3462,9 @@ static void server_main_loop(int remaining_childre
  * finite number of children can die, and it's pretty
  * pathological for a lot to die suddenly.
  */
-continue;
+if (!APR_PROC_CHECK_SIGNALED(exitwhy)) {
+continue;
+}
 }
 else if (remaining_children_to_start) {
 /* we hit a 1 second timeout in which none of the previous
--

could do?
Any signal that killed the process would mean something unexpected
happened since we clean_child_exit() otherwise?


> Looking at httpd.h
> the values of the existing constants seem a little bit strange:
>
> /** a normal exit */
> #define APEXIT_OK   0x0
> /** A fatal error arising during the server's init sequence */
> #define APEXIT_INIT 0x2
> /**  The child died during its init sequence */
> #define APEXIT_CHILDINIT0x3
> /**
>  *   The child exited due to a resource shortage.
>  *   The parent should limit the rate of forking until
>  *   the situation is resolved.
>  */
> #define APEXIT_CHILDSICK0x7
> /**
>  * A fatal error, resulting in the whole server aborting.
>  * If a child exits with this error, the parent process
>  * considers this a server-wide fatal error and aborts.
>  */
> #define APEXIT_CHILDFATAL   0xf
>
>
> I haven't found any hint during my investigation why they are as they are on 
> not just 1, 2, 3, 
> Any ideas?

Nope :/


Regards;
Yann.


Re: Faster start children after fatal signals?

2022-04-14 Thread Ruediger Pluem



On 4/13/22 5:41 PM, Yann Ylavic wrote:
> On Wed, Apr 13, 2022 at 4:30 PM Ruediger Pluem  wrote:
>>
>> While looking at PR65769 I stumbled across the below in event.c (same in 
>> worker.c)
>>
>> 3460/* Don't perform idle maintenance when a child dies,
>> 3461 * only do it when there's a timeout.  Remember only a
>> 3462 * finite number of children can die, and it's pretty
>> 3463 * pathological for a lot to die suddenly.
>> 3464 */
>> 3465continue;
>>
>> In case several processes or even all die by a segfault we would wait until 
>> we have processed all that processes plus one second
>> until we start new processes. Shouldn't we perform an 
>> perform_idle_server_maintenance in case the process died because of a fatal
>> signal?
> 
> +1
> 

In order to solve this I would like to add another APEXIT_ constant (e.g. 
APEXIT_FATAL_SIGNAL). Looking at httpd.h
the values of the existing constants seem a little bit strange:

/** a normal exit */
#define APEXIT_OK   0x0
/** A fatal error arising during the server's init sequence */
#define APEXIT_INIT 0x2
/**  The child died during its init sequence */
#define APEXIT_CHILDINIT0x3
/**
 *   The child exited due to a resource shortage.
 *   The parent should limit the rate of forking until
 *   the situation is resolved.
 */
#define APEXIT_CHILDSICK0x7
/**
 * A fatal error, resulting in the whole server aborting.
 * If a child exits with this error, the parent process
 * considers this a server-wide fatal error and aborts.
 */
#define APEXIT_CHILDFATAL   0xf


I haven't found any hint during my investigation why they are as they are on 
not just 1, 2, 3, 
Any ideas?


Regards

Rüdiger


Re: Faster start children after fatal signals?

2022-04-13 Thread Yann Ylavic
On Wed, Apr 13, 2022 at 4:30 PM Ruediger Pluem  wrote:
>
> While looking at PR65769 I stumbled across the below in event.c (same in 
> worker.c)
>
> 3460/* Don't perform idle maintenance when a child dies,
> 3461 * only do it when there's a timeout.  Remember only a
> 3462 * finite number of children can die, and it's pretty
> 3463 * pathological for a lot to die suddenly.
> 3464 */
> 3465continue;
>
> In case several processes or even all die by a segfault we would wait until 
> we have processed all that processes plus one second
> until we start new processes. Shouldn't we perform an 
> perform_idle_server_maintenance in case the process died because of a fatal
> signal?

+1