Thanks a lot @Fabian and @Fiona for helping me debug this!

The problem is that some libaries do overwrite the SIGCHLD handler
temporarily, if the library is called fast enough this can lead to lost
CHLD signals which in turn prevents `worker_reaper` from being called in
RESTEnvironment. So tasks won't get cleaned-up until a different SIGCHLD
arrives at the same `pvedeamon` process triggering `worker_reaper`.

As @Fabian mentioned in [1] a general re-work of the task handling,
potentially with `pidfd`s, would make a lot of sense.

These two patches address the problem in the task handling structure as
it currently is. They
 - run the PAM lib call in a fork, so signal handler changes the library
   does are isloated from our process
 - run `worker_reaper` periodically (5s) do catch any other potential
   instances of this, since it would be possible that the same happens
   with other libs, not just PAM

[1] 
https://lore.proxmox.com/pve-devel/[email protected]/T/#m7b0f3873be5755f330e288cfa50905744f225b2b


pve-common:

Hannes Laimer (1):
  RESTEnvironment: periodically reap workers as SIGCHLD fallback

 src/PVE/RESTEnvironment.pm | 9 +++++++++
 1 file changed, 9 insertions(+)


pve-access-control:

Hannes Laimer (1):
  pam: fork for PAM authentication to isolate SIGCHLD handler

 src/PVE/Auth/PAM.pm | 74 +++++++++++++++++++++++++--------------------
 1 file changed, 42 insertions(+), 32 deletions(-)


Summary over all repositories:
  2 files changed, 51 insertions(+), 32 deletions(-)

-- 
Generated by murpp 0.9.0



Reply via email to