On March 4, 2026 2:46 pm, Hannes Laimer wrote:
> Libraries may temporarily override $SIG{CHLD}, causing worker exit
> signals to be lost. Poll worker_reaper every 5 seconds via an AnyEvent
> timer to catch any missed signals in API server contexts where this can
> be problematic.
>
> Signed-off-by: Hannes Laimer <[email protected]>
> ---
> src/PVE/RESTEnvironment.pm | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/src/PVE/RESTEnvironment.pm b/src/PVE/RESTEnvironment.pm
> index 81d7e29..cb44823 100644
> --- a/src/PVE/RESTEnvironment.pm
> +++ b/src/PVE/RESTEnvironment.pm
> @@ -37,6 +37,7 @@ my $rest_env;
>
> my $WORKER_PIDS;
> my $WORKER_FLAG = 0;
> +my $worker_reaper_timer;
>
> my $log_task_result = sub {
> my ($upid, $user, $status) = @_;
> @@ -124,6 +125,14 @@ sub init {
> }
> };
>
> + # Periodically reap workers as a fallback in case a library call
> temporarily overrides
> + # $SIG{CHLD} and causes us to miss a signal. Only useful when an event
> loop is running.
I'd replace the "Only useful" with something like "Only runs", makes it more
explicit.
AFAICT, this not only affects SIGCHLD being redirected, but also SIGCHLD
triggering while we are processing a previous SIGCHLD (if we checked
that particular PID already while iterating, but before it exited).
we could maybe consider to change worker_reaper to do
my $dead;
do {
$dead = waitpid(-1, WNOHANG))
if (my $info = $WORKER_PIDS->{$dead}) {
# log and delete
}
} while $dead > 0;
that way, it should be faster/cheaper and handle other (transitive,
non-worker) zombies that might have been missed by whatever forked them?
we do have quite a few places where we explicitly wait for a specific
child though, that need to be checked because they might break as a
result..
so in the end the best way to make this better is maybe to use pidfds
via AnyEvent if we have an event loop, to get notified and handle
workers exiting without relying on SIGCHLD at all..
for CLI usage, we already waitpid the forked worker directly anyway (so
should not have a zombie problem, and that flow could switch to a pidfd
as well).. there's two fork_worker calls that set the background flag,
but I don't even see how they'd be invoked in a CLI environment..
> + $worker_reaper_timer = AnyEvent->timer(
> + after => 5,
> + interval => 5,
> + cb => sub { $worker_reaper->() },
worker_reaper isn't that cheap atm, I don't think we want to run it
every 5s.. on a busy server with lots of tasks, this will do a syscall
per registered task worker..
since this is "just" the fallback for the rare case of "somehow we
missed our signal", doing it once every minute or so should be enough?
I think we can apply this with the timer interval set to a more sensible
value, and the comments slightly adapted..
> + );
> +
> # environment types
> # cli ... command started fron command line
> # pub ... access from public server (pveproxy)
> --
> 2.47.3
>
>
>
>
>
>