On March 3, 2026 9:37 am, Hannes Laimer wrote: > On 2026-03-03 09:24, Fabian Grünbichler wrote: >> On March 3, 2026 8:15 am, Hannes Laimer wrote: >>> If the worker finishes right after we `waitpid` but before we add it to >>> `WORKER_PIDS` the `worker_reaper` won't `waitpid` it cause it iterates >>> over `WORKER_PIDS`. So >> >> it would be interesting to get more details how this happens in practice >> (with your reproducer)? >> > > I do have a reproducer for task processes sticking around as zombies > when they are done, but this change unfortunately did not fix that. I > just noticed this in the process of finding the cause for the "original" > problem, so I guess this is not a problem in practice, cause of the > tight timings? But technically it would be possible (I think)
so we figured that one out in the meantime.. and it is probably best to fix that issue by revamping the worker tracking entirely, both to fix the bug and to improve performance/reduce overhead. I still think we want to improve the error handling during forking, and that this patch here doesn't actually fix anything substantial other than temporary zombies if the worker terminates during setup. it shouldn't hurt either though.. >> the sequence when forking a worker is: >> - fork >> - child executes some setup code >> - child tells parent it is ready >> - child waits for parent to tell it it can continue >> >> register_worker is called by the parent in between the last two steps >> (after receiving the notification form the child, but before sending the >> notification to the child), so why does the child disappear inbetween? >> >> I think this might actually (also?) be missing error handling in >> fork_worker? all the POSIX::close/read/write calls there don't check for >> failure, which means we attempt to register a worker that has already >> failed at that point? >> > > could be, but I don't think that should influence if a `SIGCHLD` is sent > when the child is done? Cause the handler for `SIGCHLD` in the parent is > never called... > I'll take a look at that, thanks for the pointer! > >> and, somewhat tangentially related - should we switch this code over to >> use pidfds and waitid to close PID reuse races? >> > > @Wolfgang also mentioned that, would probably make sense
