Hi all, I am happy to see this thread appear. I emailed Christian and others ~5 years ago about this in this thread[1]; it would be great to see it finally happen!
I very much agree that the new process spawning should be pidfd based. I also want to emphasize that the crux of the matter is that code needed to set up the initial unscheduled process --- which I do think should be "real state" and more than a mere template --- is currently chopped up between clone and exec. So the real meat of the implementation would be factoring out a bunch of stuff so it can be reused in both the legacy clone+exec and modern code paths. I'll say a bit more about this "real state" vs "mere template" distinction, which is that the latter is effectively some sort of ad-hoc operation batching language, and always runs the risk of falling behind what the kernel actually supports. The "real state" approach, where we have honest-to-goodness process state, just in some partially initialized fashion and thus it's not yet scheduled, always supports everything the kernel supports in principle. Yes, alternative syscalls that specify which "embryonic" process (as opposed to always the current active process) need to be created, but that is less bad than trying to stuff things into flags etc. for a single existing system call, and also one can imagine a world (as described in https://catern.com/rsys21.pdf) where the exact "which process?" parameter starts getting added to new process modifying machinery by *default*, with a sentinel value analogous to `AT_FDCWD` used to mean "the current process" for the legacy used-between-fork-and-exec usecase. --- Anyways, years ago, after taking a glance at the relevant code in Linux and FreeBSD, I figured that it would be easier for me personally to first implement this functionality in FreeBSD, and then, once I had a feel for some of the refactoring, take a stab at it in Linux. This is because Linux's feature set, especially things like `binfmt_misc`, makes its clone and exec quite a bit more complex, and thus the (IMO) necessary heavy refactoring quite a bit more extensive too. I never got around to it in the 5 years, but these days, with LLMs, doing an "exploratory refactor" (to get a sketch of a patch that is fodder for discussion not yet fit for actual submission) is much easier. So inspired by this thread, I took a few hours to do the exploratory FreeBSD refactor in [2]. The man page for the new syscalls, [3], might be a good place to start reading. (This, being from a FreeBSD patch, describes the change in terms of "proc fds", but the switch to Linux's "pidfds" should be self-explanatory. The former after all inspired the latter.) Hope discussion of such a patch isn't too off topic here, but there is an interesting thing to note that would also apply to a Linux implementation. It took *more* factored out helper functions than I thought. The current count is over 15(!) --- there didn't seem to be a way to build both the old and new way of doing things with fewer, coarser building blocks. Now, granted, maybe someone more familiar with either kernel than me could do a better job, but I think it will still be a number of functions. This indicates just how much untangling there is to do. And the number will surely be much higher for Linux. [1]: https://lore.kernel.org/all/[email protected]/ [2]: https://github.com/obsidiansystems/freebsd-src/commit/better-proc-spawn 239dcdefe6ad244e58d998155b527375e5293ff7 for posterity [3]: https://raw.githubusercontent.com/obsidiansystems/freebsd-src/refs/heads/better-proc-spawn/lib/libsys/proc_new.2 On Sun, May 31, 2026, at 10:47 PM, Li Chen wrote: > Hi Christian, > > Thanks a lot for your great review! > > ---- On Thu, 28 May 2026 19:02:53 +0800 Christian Brauner > <[email protected]> wrote --- > > On Thu, May 28, 2026 at 05:52:21PM +0800, Li Chen wrote: > > > Hi, > > > > > > This is an early RFC for an idea that is probably still rough in both the > > > UAPI and implementation details. Sorry for the rough edges; I am sending > > > it now to check whether this direction is worth pursuing and to get > > > feedback on the kernel/userspace boundary. > > > > The idea of having a builder api for exec isn't all that crazy. But it > > should simply be built on top of pidfds and thus pidfs itself instead. > > It has all the basic infrastructure in place already. > > Yes, that makes a lot more sense. I was staring too hard at the "hot > executable" part and made the cache/template the API, which was probably > the wrong thing to expose. Sorry about that. > > > Any implementation > > should also allow userspace to implement posix_spawn() on top of it. > > That's so cool, and this is a really useful point. I had not thought about > this as > something that could sit under posix_spawn(), but that makes the target > much clearer. It should be a generic exec/spawn builder first, and the > agent use case should just be one user of it. > > > fd = pidfd_open(0, PIDFD_EMPTY /* or better name */) > > > > pidfd_config(fd, ...) // modeled similar to fsconfig() > > Reusing pidfd_open() with an empty target is nice because it keeps the API > close > to pidfds, but I wonder if a separate entry point such as > pidfd_spawn_open() or pidfd_create() would make the "new process > builder" case a bit more explicit? Either way, the configuration side > being fsconfig-like makes sense to me. Yeah check out my syscalls [3] on that front. It's important to design the workflow / state machine in a good way. Performance/efficiency, security (share less state/privileges by default!), and extensibility (where will newer concepts, like a new type of namespace, fit in?) are all competing concerns, but I think they mostly pull in the same direction. (Only no ambient authority, back compat, and extensibility exist in some tension.) > Thanks again for pointing me in this direction. It helps a lot. > > Regards, > Li Glad you are sold on pidfds, and more broadly, best of luck! You'll be a hero to everyone else that has wanted this over the years :) John

