On Mon, Jun 8, 2026 at 5:02 PM Jann Horn <[email protected]> wrote: > > On Thu, May 28, 2026 at 2:55 PM Mateusz Guzik <[email protected]> wrote: > > This problem is dear to my heart and I have been pondering it on and off > > for some time now. The entire fork + exec idiom is terrible and needs to > > be retired. > > It seems to me like vfork+exec is a decent UAPI building block, on > which you can build nice-looking userspace APIs, though I agree that > this is not an ideal direct interface for application code. > > > Additionally there is a known problem where transiently copied file > > descriptors on fork + exec cause a headache in multithreaded programs > > doing something like this in parallel. I only did cursory reading, it > > seems your patchset keeps the same problem in place. > > I think we almost have UAPI that would let you avoid this issue? > You can use clone() with CLONE_FILES, then unshare the FD table with > close_range(3, UINT_MAX, CLOSE_RANGE_UNSHARE). That is not currently > implemented to be atomic with stuff that happens on other threads, but > if we changed that, and it doesn't provide a good way to carry some > FDs across, but it feels to me like this could be fixed with a variant > of close_range() that removes O_CLOEXEC FDs except ones listed in an > array.
Suppose you want to exec a binary with the following fd set: 0 is /dev/null 1 is fd 1023 in your process 2 is fd 1023 in your process You have tons of other fds and you don't want any of them anywhere near this. Clean interface from my standpoint would avoid any unnecessary overhead and would allow you to clearly specify what do you want. In this case whatever the interface it should provide the ability to map 1023 to 1 and 2 in the child. With the current syscall set you get refs taken on these on clone, then you have to manually dup2 these which is separate syscalls with extra atomics on top. A fast & elegant solution would allow you to tell the kernel directly where to install the 2 files. Also note in practical terms userspace likes to closefrom/close_range anyway to get rid of unwanted fds which happen to not have the cloexec bit which is yet another syscall to invoke on the way to exec. A better interface would instantly avoid the problem by not copying the unwanted fds if not asked. For viability for use as foundation to build posix_spawn over it such copying would have to be supported of course. > > > There are numerous impactful ways to speed up execs both in terms of > > single-threaded cost and their multicore scalability, most of which > > would be immediately usable by all programs without an opt-in. imo these > > needs to be exhausted before something like a "template" can be > > considered. > > (I think probably a large part of this would be stuff that happens in > userspace, like dynamic linking.) I have not investigated userspace, even putting specific APIs aside the kernel has *a lot* of avoidable overhead. > > > Per the above, the primary win would stem from *NOT* messing with mm. > > As you write below, I think we have that with CLONE_MM? The C function > vfork() is kind of a terrible API because of its returns-twice > behavior, but I think if process cloning with CLONE_VM|CLONE_VFORK was > wrapped by libc in a way similar to clone() (with the child executing > a separate handler function), or if it was used in the implementation > of some higher-level process-spawning API, it would be a perfectly > fine API? > > Or am I misunderstanding what you mean by "messing with mm"? > I was not aware of this functionality, let's assume it indeed works. You still have the file issue described above.

