On March 3, 2026 8:15 am, Hannes Laimer wrote:
> If the worker finishes right after we `waitpid` but before we add it to
> `WORKER_PIDS` the `worker_reaper` won't `waitpid` it cause it iterates
> over `WORKER_PIDS`. So

it would be interesting to get more details how this happens in practice
(with your reproducer)?

the sequence when forking a worker is:
- fork
- child executes some setup code
- child tells parent it is ready
- child waits for parent to tell it it can continue

register_worker is called by the parent in between the last two steps
(after receiving the notification form the child, but before sending the
notification to the child), so why does the child disappear inbetween?

I think this might actually (also?) be missing error handling in
fork_worker? all the POSIX::close/read/write calls there don't check for
failure, which means we attempt to register a worker that has already
failed at that point?

and, somewhat tangentially related - should we switch this code over to
use pidfds and waitid to close PID reuse races?

>  - the clean-up triggered by the SIGCHLD won't catch it cause it needs it to
>    be in `WORKER_PIDS`
>  - and, `register_worker` won't because it was still running when it
>    `waitpid`'ed it
> 
> Moving the insertion into `WORKER_PIDS` before the `waitpid` solves
> this by making sure it is
>  - always in the var for `worker_reaper`
>  - and, if SIGCHILD should trigger `worker_reaper` before we add it to
>    `WORKER_PIDS`, the `waitpid` in `register_worker` itself will catch
>    it
> 
> Signed-off-by: Hannes Laimer <[email protected]>
> ---
>  src/PVE/RESTEnvironment.pm | 11 ++++++-----
>  1 file changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/src/PVE/RESTEnvironment.pm b/src/PVE/RESTEnvironment.pm
> index 4ed5c05..4677687 100644
> --- a/src/PVE/RESTEnvironment.pm
> +++ b/src/PVE/RESTEnvironment.pm
> @@ -99,17 +99,18 @@ my $register_worker = sub {
>  
>      return if !$pid;
>  
> -    # do not register if already finished
> +    $WORKER_PIDS->{$pid} = {
> +        user => $user,
> +        upid => $upid,
> +    };
> +
> +    # remove immediately if already finished
>      my $waitpid = waitpid($pid, WNOHANG);
>      if (defined($waitpid) && ($waitpid == $pid)) {
>          delete($WORKER_PIDS->{$pid});
>          return;
>      }
>  
> -    $WORKER_PIDS->{$pid} = {
> -        user => $user,
> -        upid => $upid,
> -    };
>  };
>  
>  # initialize environment - must be called once at program startup
> -- 
> 2.47.3
> 
> 
> 
> 
> 
> 



Reply via email to