On Fri, Feb 06, 2026 at 02:16:13PM -0500, Waiman Long wrote:

> > In all cases when we get to that point, new_fs is always a freshly
> > created private copy of current->fs, not reachable from anywhere
> > other than stack frames of the callers, but the proof is not pretty.
> > copy_mnt_ns() is called only by create_new_namespaces() and it gets to
> > copying anything if and only if CLONE_NEWNS is in the flags.  So far,
> > so good.  The call in create_new_namespaces() is
> >     new_nsp->mnt_ns = copy_mnt_ns(flags, tsk->nsproxy->mnt_ns, user_ns, 
> > new_fs);
> 
> Thanks for the detailed explanation. After further investigation as to while
> the pwd_refs is set, I found out the code path leading to this situation is
> the unshare syscall.
> 
> __x64_sys_unshare()
>  => ksys_unshare()
>   => unshare_fs(unshare_flags, &new_fs)
>   => unshare_nsproxy_namespaces(unshare_flags, &new_nsproxy,
>                                          new_cred, new_fs);
>    => create_new_namespaces(unshare_flags, current, user_ns,
>                                          new_fs ? new_fs : current->fs);
> 
> Here, CLONE_FS isn't set in unshare_flags. So new_fs is NULL and
> current->fs is passed down to create_new_namespaces(). That is why
> pwd_refs can be set in this case. So it looks like the comment in
> copy_mnt_ns() saying that the fs_struct is private is no longer true,
> at least in this case. So changing fs_struct without taking the lock
> can lead to unexpected result.

CLONE_FS is the red herring here (it will be set earlier in ksys_unshare()
if CLONE_NEWNS is there).  I really hate that how convoluted that is,
though.

Look: the case where we might get passed current->fs down there is real.
It can happen in one and only one situation - CLONE_NEWNS in unshare(2)
arguments *and* current->fs->users being 1.

It wouldn't suffice, since there's chroot_fs_refs() that doesn't give
a rat's arse for task->fs being ours - it goes and replaces every
->fs->pwd or ->fs->root that happens to point to old_root.

It's still not a real race, though - both chroot_fs_refs() and that area
in copy_mnt_ns() are serialized on namespace_sem.

And yes, it's obscenely byzantine.  It gets even worse when you consider
the fact that pivot_root(2) does not break only because the refcount
drops in chroot_fs_refs() are guaranteed not to reach 0 - the caller is
holding its own references to old_root.{mnt,dentry} and *thar* does not
get dropped until we drop namespace_sem.

IOW, that shit is actually safe, but man, has its correctness grown fucking
convoluted...

Grabbing fs->seq in copy_mnt_ns() wouldn't make the things better, though -
it seriously relies upon the same exclusion with chroot_fs_refs() for
correctness; unless you are willing to hold it over the entire walk through
the mount tree, the proof of correctness doesn't get any simpler.

Reply via email to