[Bug 63169] MPM event, stuck process after graceful: no (old gen)

bugzilla Fri, 03 Sep 2021 12:03:29 -0700

https://bz.apache.org/bugzilla/show_bug.cgi?id=63169


--- Comment #7 from Joel Self <joels...@gmail.com> ---
What's happening here is that the mpm is spawning more processes than the
`active_daemons_limit` (`max_workers divided` by workers per process), but then
during a graceful restart it is only sending `active_daemons_limit` number of
kill characters (per bucket) over the pipe of death. You can see this in line
3114 of server/mpm/event/event.c (httpd 2.4.48):

```
        for (i = 0; i < num_buckets; i++) {
            ap_mpm_podx_killpg(all_buckets[i].pod, active_daemons_limit,
                               AP_MPM_PODX_GRACEFUL);
        }
```

The reason the mpm is spawning more child processes than `active_daemons_limit`
is due to the exponential increasing in child processes created when the server
is suddenly hit with a lot of traffic. In event.c
`perform_idle_server_maintenance` examines each active process and checks how
many active and idle threads there are and marks which slots are free to create
a new child process in (there's no process in that slot). Then
`perform_idle_server_maintenance` checks if the idle thread count is less than
the minimum spare threads (line 2824):

```
else if (idle_thread_count < min_spare_threads / num_buckets) {
```

If the idle thread count is too low it needs to create new child processes to
increase its idle thread count. At first it only creates a single child, but
before `perform_idle_server_maintenance` returns it doubles the spawn rate, so
the next time around perform_idle_server_maintenance will spawn two children
(line 2881):

```
            else if (retained->idle_spawn_rate[child_bucket]
                     < MAX_SPAWN_RATE / num_buckets) {
                retained->idle_spawn_rate[child_bucket] *= 2;
            }
```



If the next time around perform_idle_server_maintenance doesn't need to spawn
new children then the rate is returned to 1. However if the next call to
`perform_idle_server_maintenance` needs to spawn children it will spawn 2 then
increase the spawn rate to 4. This exponential growth continues until the
`active_thread_count` is greater than or equal to `max_workers`. However,
because of the exponential growth the last batch of child processes spawned may
have created way more than the number of processes needed to reach
`max_workers` and thus overshoot the 'active_daemons_limit`.

Here's a debug log I created to debug this issue:

```
idle_thread_count [0] < min_spare_threads [200], active_thread_count: 675,
max_workers: 800
Spawning 16 children, active_daemons: 27, total_daemons: 27, idle_thread_count:
0, min_spare_threads: 200, active_thread_count: 675, max_workers: 800

```

We're at 0 idle threads. The last time we spawned children we made 8, so this
time we'll make 16, but our active threads is 675. At 25 workers per processes,
spawning 16 processes will get us to `675 + 16 * 25 = 1075` workers which is
well beyond our `max_workers` of 800. Another way to look at it is that our
`active_daemons_limit` is 32 and adding 16 new processes to the current 27 gets
us to 43 which is 11 more than our max of 32. After we get 43 processes if we
then do a graceful restart will only send 32 kill characters over the pipe of
death (we only have 1 bucket), so 11 processes will never receive the kill
character and will live on, processing requests with the old configuration.

We fixed this problem by limiting the amount of newly spawned children plus the
`active_daemons` to be less than `active_daemons_limit`. Added at line 2856 of
event.c:

```
            if (free_length + active_daemons > active_daemons_limit) {
                free_length = active_daemons_limit - active_daemons;
            }
```

I've attached to this bug a patch that we used to fix this.

Thanks,
Joel Self

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscr...@httpd.apache.org
For additional commands, e-mail: bugs-h...@httpd.apache.org

[Bug 63169] MPM event, stuck process after graceful: no (old gen)

Reply via email to