On Wed, Feb 25, 2026 at 5:33 AM Pavel Tikhomirov <[email protected]> wrote: > > This effectively gives us an ability to create the pid namespace init as > a child of the process (setns-ed to the pid namespace) different to the > process which created the pid namespace itself. > > Original problem: > > There is a cool set_tid feature in clone3() syscall, it allows you to > create process with desired pids on multiple pid namespace levels. Which > is useful to restore processes in CRIU for nested pid namespace case. > > In nested container case we can potentially see this kind of pid/user > namespace tree: > > Process > ┌─────────┐ > User NS0 ──▶ Pid NS0 ──▶ Pid p0 │ > │ │ │ │ > ▼ ▼ │ │ > User NS1 ──▶ Pid NS1 ──▶ Pid p1 │ > │ │ │ │ > ... ... │ ... │ > │ │ │ │ > ▼ ▼ │ │ > User NSn ──▶ Pid NSn ──▶ Pid pn │ > └─────────┘ > > So to create the "Process" and set pids {p0, p1, ... pn} for it on all > pid namespace levels we can use clone3() syscall set_tid feature, BUT > the syscall does not allow you to set pid on pid namespace levels you > don't have permission to. So basically you have to be in "User NS0" when > creating the "Process" to actually be able to set pids on all levels. > > It is ok for almost any process, but with pid namespace init this does > not work, as currently we can only create pid namespace init and the pid > namespace itself simultaneously, so to make "Pid NSn" owned by "User > NSn" we have to be in the "User NSn". > > We can't possibly be in "User NS0" and "User NSn" at the same time, > hence the problem. > > Alternative solution: > > Yes, for the case of pid namespace init we can use old and gold > /proc/sys/kernel/ns_last_pid interface on the levels lower than n. But > it is much more complicated and introduces tons of extra code to do. It > would be nice to make clone3() set_tid interface also aplicable to this > corner case. > > Implementation: > > Now when anyone can setns to the pid namespace before the creation of > init, and thus multiple processes can fork children to the pid > namespace, it is important that we enforce the first process created is > always pid namespace init. (Note that this was done by the previous > preparational patch as a standalon useful change.) We only allow other > processes after the init sets pid_namespace->child_reaper. > > Reviewed-by: Oleg Nesterov <[email protected]> > Signed-off-by: Pavel Tikhomirov <[email protected]>
Acked-by: Andrei Vagin <[email protected]> Thanks, Andrei

