On Thu, Nov 30, 2023 at 08:35:12AM +0100, Donald Buczek wrote:
> On 11/29/23 22:43, Kent Overstreet wrote:
> > On Tue, Nov 28, 2023 at 08:49:23AM +0100, Donald Buczek wrote:
> > > Hi,
> > > 
> > > I'm happy, bcachefs finally made it to mainline, Kudos to Kent!
> > > 
> > > As far as I know, there is an ongoing discussion about the problems of 
> > > non-unique inode numbers exposed by snapshots an no real conclusion yet, 
> > > on how filesystems should expose snapshots.
> > > 
> > > The current behavior is kind of a showstopper for us, because we are 
> > > still running multi-user systems. The problem is, that any unprivileged 
> > > user can create subvolumes and snapshots:
> > > 
> > >      buczek@dose:/scratch/local3$ bcachefs subvolume create vol1
> > >      buczek@dose:/scratch/local3$ mkdir vol1/dir1
> > >      buczek@dose:/scratch/local3$ bcachefs subvolume snapshot vol1/snp1
> > >      buczek@dose:/scratch/local3$ ls -li vol1/
> > >      total 0
> > >      1342189197 drwxrwxr-x 2 buczek buczek 0 Nov 20 15:01 dir1
> > >      1476413180 drwxrwxr-x 3 buczek buczek 0 Nov 20 15:01 snp1
> > >      buczek@dose:/scratch/local3$ ls -li vol1/snp1/
> > >      total 0
> > >      1342189197 drwxrwxr-x 2 buczek buczek 0 Nov 20 15:01 dir1
> > >      buczek@dose:/scratch/local3$ find .
> > >      .
> > >      ./vol1
> > >      find: File system loop detected; ‘./vol1/snp1’ is part of the same 
> > > file system loop as ‘./vol1’.
> > >      ./vol1/dir1
> > >      buczek@dose:/scratch/local3$
> > > 
> > > We have a few tools which walk over the filesystem (backup, mirror, 
> > > accounting) and these are just not prepared for non-unique inode numbers 
> > > aside from regular hardlinks. I'm concerned that experiments of a 
> > > unprivileged users could make important tools to fail.
> > > 
> > > Questions:
> > > 
> > > - Would it be a workaround to make creation of subvolumes and snapshots 
> > > privileged operations?
> > > 
> > > - If we want to evolve our tools: What is the best way for userspace to 
> > > recognize subvolumes and snapshots and tell them apart from ordinary 
> > > directories?
> > 
> > I'm considering writing an "integer identifiers considered harmful"
> > paper. I haven't dug into it yet - just skimmed the lwn coverage - but
> > adding in the overlayfs issue makes this problem sound fundamentally
> > unsolvable given the current constraints.
> > 
> > Recognizing subvolume boundaries: we don't have anything for this yet,
> > and given what happened with btrfs I'm not sure we want to go that
> > route.
> 
> A fabricated fsid - I think, that is what btrfs does - wouldn't help
> us either. The mentioned tools would just stop at the assumed mount
> points.

*nod* - nor for NFS.

> > We could (probably) expose unique inode numbers, filesystem wide, but at
> > a cost. We'd have to restrict ourselves internally to 32 bit inode
> > numbers, and then report inode numbers that include the subvolume ID as
> > the high 32 bits.
> > 
> > That won't leave enough bits to shard inode numbers at create time based
> > on CPU id, which isn't the _most_ important optimization, but would hurt
> > to lose.
> > 
> > Going forward, we really need - as you allude to - better userspace
> > APIs. Inode number probably can't be an integer anymore, it needs to be
> > a string if it's going to be able to do all the things we want it to do.
> 
> Should be an uuid then. Expose inode uuid (and/or snapshot uuid) via
> xattr? Anyway, I know, people more competent than me are thinking
> about the problem.

Exposing the subvolume ID via an xattr would be easy, if that works for
you.

> What to you think about the suggestion to make creation of snapshots
> (and possible volumes, not sure about that) a privileged operation
> (optional, of course) ? I know, multi-user is niche nowadays, but for
> me it looks like a rather cheap workaround.

Yeah, that's not an unreasonable ask.

Reply via email to