* Jann Horn: >> Per the above, the primary win would stem from *NOT* messing with mm. > > As you write below, I think we have that with CLONE_MM? The C function > vfork() is kind of a terrible API because of its returns-twice > behavior, but I think if process cloning with CLONE_VM|CLONE_VFORK was > wrapped by libc in a way similar to clone() (with the child executing > a separate handler function), or if it was used in the implementation > of some higher-level process-spawning API, it would be a perfectly > fine API?
No, there is still a problem with SIGTSTP handling because we cannot atomically unmask the signal during execve. We need to unblock SIGTSTP before execve in the new process, but this means that it can get suspended by SIGTSTP. Consequently, the execve never happens and the original process is stuck in vfork: posix_spawn: parent can get stuck in uninterruptible sleep if child receives SIGTSTP early enough <https://inbox.sourceware.org/libc-help/[email protected]/> More on the low-level side, it's difficult to make sure that execve gets a consistent snapshot of the environ vector. Both vfork and execve need to be async-signal-safe. Any locking or memory allocation (except for the stack …) persists in the original process after vfork returns. The environ vector can be large, so making a copy on the stack is not ideal. It's even harder for getenv/setenv/unsetenv implementations that use locking instead of software transactional memory. In general, I prefer the vfork+execve API over things like posix_spawn because eventually, you have dependencies between the syslets, or need control flow. This introduces a lot of complexity. Conceptually, vfork+execve is much simpler, and in many ways quite safe (even mutexes work as long as they do not need a correct TID). Thanks, Florian

