On Mon, 04 Dec 2023, Donald Buczek wrote:
> 
> On 11/30/23 21:37, NeilBrown wrote:
> > On Thu, 30 Nov 2023, Kent Overstreet wrote:
> >> On Thu, Nov 30, 2023 at 08:35:12AM +0100, Donald Buczek wrote:
> >>> On 11/29/23 22:43, Kent Overstreet wrote:
> >>>> On Tue, Nov 28, 2023 at 08:49:23AM +0100, Donald Buczek wrote:
> >>>>> Hi,
> >>>>>
> >>>>> I'm happy, bcachefs finally made it to mainline, Kudos to Kent!
> >>>>>
> >>>>> As far as I know, there is an ongoing discussion about the problems of 
> >>>>> non-unique inode numbers exposed by snapshots an no real conclusion 
> >>>>> yet, on how filesystems should expose snapshots.
> >>>>>
> >>>>> The current behavior is kind of a showstopper for us, because we are 
> >>>>> still running multi-user systems. The problem is, that any unprivileged 
> >>>>> user can create subvolumes and snapshots:
> >>>>>
> >>>>>       buczek@dose:/scratch/local3$ bcachefs subvolume create vol1
> >>>>>       buczek@dose:/scratch/local3$ mkdir vol1/dir1
> >>>>>       buczek@dose:/scratch/local3$ bcachefs subvolume snapshot vol1/snp1
> >>>>>       buczek@dose:/scratch/local3$ ls -li vol1/
> >>>>>       total 0
> >>>>>       1342189197 drwxrwxr-x 2 buczek buczek 0 Nov 20 15:01 dir1
> >>>>>       1476413180 drwxrwxr-x 3 buczek buczek 0 Nov 20 15:01 snp1
> >>>>>       buczek@dose:/scratch/local3$ ls -li vol1/snp1/
> >>>>>       total 0
> >>>>>       1342189197 drwxrwxr-x 2 buczek buczek 0 Nov 20 15:01 dir1
> >>>>>       buczek@dose:/scratch/local3$ find .
> >>>>>       .
> >>>>>       ./vol1
> >>>>>       find: File system loop detected; ‘./vol1/snp1’ is part of the 
> >>>>> same file system loop as ‘./vol1’.
> >>>>>       ./vol1/dir1
> >>>>>       buczek@dose:/scratch/local3$
> >>>>>
> >>>>> We have a few tools which walk over the filesystem (backup, mirror, 
> >>>>> accounting) and these are just not prepared for non-unique inode 
> >>>>> numbers aside from regular hardlinks. I'm concerned that experiments of 
> >>>>> a unprivileged users could make important tools to fail.
> >>>>>
> >>>>> Questions:
> >>>>>
> >>>>> - Would it be a workaround to make creation of subvolumes and snapshots 
> >>>>> privileged operations?
> >>>>>
> >>>>> - If we want to evolve our tools: What is the best way for userspace to 
> >>>>> recognize subvolumes and snapshots and tell them apart from ordinary 
> >>>>> directories?
> >>>>
> >>>> I'm considering writing an "integer identifiers considered harmful"
> >>>> paper. I haven't dug into it yet - just skimmed the lwn coverage - but
> >>>> adding in the overlayfs issue makes this problem sound fundamentally
> >>>> unsolvable given the current constraints.
> >>>>
> >>>> Recognizing subvolume boundaries: we don't have anything for this yet,
> >>>> and given what happened with btrfs I'm not sure we want to go that
> >>>> route.
> >>>
> >>> A fabricated fsid - I think, that is what btrfs does - wouldn't help
> >>> us either. The mentioned tools would just stop at the assumed mount
> >>> points.
> >>
> >> *nod* - nor for NFS.
> >>
> >>>> We could (probably) expose unique inode numbers, filesystem wide, but at
> >>>> a cost. We'd have to restrict ourselves internally to 32 bit inode
> >>>> numbers, and then report inode numbers that include the subvolume ID as
> >>>> the high 32 bits.
> >>>>
> >>>> That won't leave enough bits to shard inode numbers at create time based
> >>>> on CPU id, which isn't the _most_ important optimization, but would hurt
> >>>> to lose.
> >>>>
> >>>> Going forward, we really need - as you allude to - better userspace
> >>>> APIs. Inode number probably can't be an integer anymore, it needs to be
> >>>> a string if it's going to be able to do all the things we want it to do.
> >>>
> >>> Should be an uuid then. Expose inode uuid (and/or snapshot uuid) via
> >>> xattr? Anyway, I know, people more competent than me are thinking
> >>> about the problem.
> >>
> >> Exposing the subvolume ID via an xattr would be easy, if that works for
> >> you.
> > 
> > I think that would be a bad choice.  Stick with the filehandle.  It is
> > an existing interface supported by many filesystems and it works over
> > NFS.
> 
> 
> Hmm, isn't a (struct file_handle) filehandle supposed to be opaque?
> I'm not sure if it is a good idea to stick filesystem-specific
> meta-information (like a subvolume ident) into special bits of opaque
> numbers (whether inode number or filehandle) at all. 

Yes, a filehandle is opaque.  Digging around inside it would be wrong.

My understanding was that your primary need was not to know where
subvolumes were, but to know when two files were different even though
they had the same inode number.  filehandles can be used to tell if two
files are different with total reliability.

> 
> The information could also be made available via specialized ioctl(s)
> as an alternative to xattr.

ioctl would be worse than xattr.  statx extension would be best if you
need something that filehandle cannot provide.

> 
> One disadvantage of embedded in filehandle, xattr or ioctl is that
> userspace would need an extra syscall for each directory just to see
> if this is a subvolume or a snapshot, which most of the time it is
> not.  Maybe, even an otherwise unnecessary open is needed to get a
> consistent view between multiple information-retrieving syscalls.  If
> only statx() was extendable.

uhmm... statx is extendable...
Adding a STATX_ATTR_SUBVOLUME_ROOT attribute flag shouldn't be too
controversial, if a clear need and use-case were presented.

> 
> You don't have this problem with a crafted fsid.  Then creating a
> subvolume or a snapshot could be seen as an implied mount.  Maybe
> that's not the worst solution. 

No, probably not the *worst*, but not far from it.  It is a really bad
non-solution.

Have a look at
 
https://github.com/SUSE/kernel-source/blob/master/patches.suse/vfs-add-super_operations-get_inode_dev

This is a monstrosity that SUSE needs to carry so that the fake fsid
that btrfs generates for subvolumes is used consistently.  There are
various interfaces in the kernel that use fsid/inode - some just for
error reporting, others for more substantial things.  In mainline linux
these do the wrong thing for btrfs subvolumes.

Please don't copy this mistake into bcachefs.

Thanks,
NeilBrown

> 
> Best
> 
>    Donald
> 
> > 
> > NeilBrown
> > 
> > 
> >>
> >>> What to you think about the suggestion to make creation of snapshots
> >>> (and possible volumes, not sure about that) a privileged operation
> >>> (optional, of course) ? I know, multi-user is niche nowadays, but for
> >>> me it looks like a rather cheap workaround.
> >>
> >> Yeah, that's not an unreasonable ask.
> >>
> > 
> 
> -- 
> Donald Buczek
> [email protected]
> Tel: +49 30 8413 1433
> 


Reply via email to