I also encountered hanging tasks while running e2e tests, often leading to tests running into timeouts even if the task was already "OK". I applied these patches to the test VMs and did not encounter problems with hanging tasks anymore, significantly speeding up the test runs.
Consider this: Tested-by: Michael Köppl <[email protected]> On Wed Mar 4, 2026 at 2:46 PM CET, Hannes Laimer wrote: > Thanks a lot @Fabian and @Fiona for helping me debug this! > > The problem is that some libaries do overwrite the SIGCHLD handler > temporarily, if the library is called fast enough this can lead to lost > CHLD signals which in turn prevents `worker_reaper` from being called in > RESTEnvironment. So tasks won't get cleaned-up until a different SIGCHLD > arrives at the same `pvedeamon` process triggering `worker_reaper`. > > As @Fabian mentioned in [1] a general re-work of the task handling, > potentially with `pidfd`s, would make a lot of sense. > > These two patches address the problem in the task handling structure as > it currently is. They > - run the PAM lib call in a fork, so signal handler changes the library > does are isloated from our process > - run `worker_reaper` periodically (5s) do catch any other potential > instances of this, since it would be possible that the same happens > with other libs, not just PAM > > [1] > https://lore.proxmox.com/pve-devel/[email protected]/T/#m7b0f3873be5755f330e288cfa50905744f225b2b > > > pve-common: > > Hannes Laimer (1): > RESTEnvironment: periodically reap workers as SIGCHLD fallback > > src/PVE/RESTEnvironment.pm | 9 +++++++++ > 1 file changed, 9 insertions(+) > > > pve-access-control: > > Hannes Laimer (1): > pam: fork for PAM authentication to isolate SIGCHLD handler > > src/PVE/Auth/PAM.pm | 74 +++++++++++++++++++++++++-------------------- > 1 file changed, 42 insertions(+), 32 deletions(-) > > > Summary over all repositories: > 2 files changed, 51 insertions(+), 32 deletions(-)
