On Wed, Jun 10, 2026 at 08:29:06PM +0800, Li Chen wrote:
>  ---- On Wed, 10 Jun 2026 01:27:47 +0800  John Ericson <[email protected]> 
> wrote --- 
>  > Hope the above answers your question? I suppose my ideas lean more on the
>  > "future" than "empty" side --- there is indeed a thread in the thread 
> group,
>  > with real VM/namespace/file descriptor etc. state. Moreover, state gets
>  > initialized before the process is started, so the actual start is a pretty
>  > lightweight step of just letting the scheduler know the now-ready process 
> can
>  > be scheduled. The only thing that distinguishes the embryonic process from 
> a
>  > real one is simply that it isn't running --- i.e. isn't (yet) available to 
> be
>  > scheduled --- so the pidfds holders are free to poke at its state.
>  > 
> 
> Thanks, this helped a lot. I looked at FreeBSD/OpenBSD/XNU after your
> note. FreeBSD has P_INEXEC, OpenBSD has PS_INEXEC, and XNU seems even
> closer with P_LINTRANSIT, described as "process in exec or in creation".
> Linux does not seem to have a single equivalent today: current->in_execve
> is only an LSM hint, while the real synchronization is spread across
> exec_update_lock, cred_guard_mutex, and the exec path.
> 
> I am switching my local WIP from the two-fd builder model to one fd,
> closer to Christian's sketch:
> 
> fd = pidfd_open(0, PIDFD_EMPTY);
> pidfd_config(fd, ...);
> pidfd_spawn_run(fd, ...);
> 
> In my current local version, I still use copy_process(), so the fd points
> at a real task_struct/pid that is not woken until run. Following
> Christian's point that existing APIs can handle this not-yet-running case
> with ESRCH, I currently make ordinary pidfd operations that need a real
> started process return -ESRCH before start.
> 
> I am not sure yet whether Linux should grow a general exec/creation
> transition state like that, or whether a narrower future-process
> lifecycle is enough for this API. I will think more about that when
> working on the pristine process version.
> 

As I tried to explain in my previous e-mail this approach does not cut
it because of NUMA.

Suppose you have a machine with 2 nodes. The parent-to-be is running
on node 0 and the child is intended to exec something on node 1.

When the parent-to-be allocates and populates stuff, it takes place with
memory backed by node 0. If you allocate task_struct, the file table and
other frequently used (and modified!) objs in this way, you are
guaranteeing performance loss due to interconnect traffic to access it.

Trying to add plumbing so that all allocations respect numa placement is
probably too cumbersome.

The primary example for that is looking up the binary to exec in the
first place.

userspace likes to pass paths which don't exist, meaning checking for
the binary before any hard work is a useful optimizaiton. Suppose the
binary to be executed is in a container bound with a taskset using
node 1 and the content of the fs part of the container is currently
fully uncached.

When you perform the lookup on node 0, you are populating a bunch of
metadata (inode, dentry) using memory from that domain. But the intended
user will only execute on node 1, again resulting in a performance loss.

In order to not do it you would need to convince VFS to allocate memory
elsewhere.

So I stand by my previous claim that ultimately a pristine child has to
be created (like in this patch), but which also has to do the work on
its own.

Suppose there is no explicit placement requested anywhere. Even in that
case there are legitimate workloads which will eventually be forced to
exec stuff on another node. Even these have a better chance retaining
full locality if the child process does all the work.

Per my previous message I don't see a clean interface to do it.
something quasi-posix_spawn is probably the least bad way out, it will
also allow userspace to easily wrap the new thing with posix_spawn
itself.

Also note there is another issue with the fd-based approach: the fd will
get inherited on fork and will hang out in the child afterwards unless
explicitly closed. Suppose you have a multithreaded program which likes
to both fork(+no exec) and fork+exec. With the fd-based approach you
have no means of stopping another thread from grabbing your state thanks
to unix defaulting to copying everything. There was an attempt to fix
this aspect with O_CLOFORK, but this got rejected.

Whatever exactly happens, NUMA is a sad fact of computing and needs to
be accounted for. The approach as proposed not only does not do it, but
it actively hinders such deployments.

Reply via email to