On Tue, Nov 28, 2023 at 08:49:23AM +0100, Donald Buczek wrote: > Hi, > > I'm happy, bcachefs finally made it to mainline, Kudos to Kent! > > As far as I know, there is an ongoing discussion about the problems of > non-unique inode numbers exposed by snapshots an no real conclusion yet, on > how filesystems should expose snapshots. > > The current behavior is kind of a showstopper for us, because we are still > running multi-user systems. The problem is, that any unprivileged user can > create subvolumes and snapshots: > > buczek@dose:/scratch/local3$ bcachefs subvolume create vol1 > buczek@dose:/scratch/local3$ mkdir vol1/dir1 > buczek@dose:/scratch/local3$ bcachefs subvolume snapshot vol1/snp1 > buczek@dose:/scratch/local3$ ls -li vol1/ > total 0 > 1342189197 drwxrwxr-x 2 buczek buczek 0 Nov 20 15:01 dir1 > 1476413180 drwxrwxr-x 3 buczek buczek 0 Nov 20 15:01 snp1 > buczek@dose:/scratch/local3$ ls -li vol1/snp1/ > total 0 > 1342189197 drwxrwxr-x 2 buczek buczek 0 Nov 20 15:01 dir1 > buczek@dose:/scratch/local3$ find . > . > ./vol1 > find: File system loop detected; ‘./vol1/snp1’ is part of the same file > system loop as ‘./vol1’. > ./vol1/dir1 > buczek@dose:/scratch/local3$ > > We have a few tools which walk over the filesystem (backup, mirror, > accounting) and these are just not prepared for non-unique inode numbers > aside from regular hardlinks. I'm concerned that experiments of a > unprivileged users could make important tools to fail. > > Questions: > > - Would it be a workaround to make creation of subvolumes and snapshots > privileged operations? > > - If we want to evolve our tools: What is the best way for userspace to > recognize subvolumes and snapshots and tell them apart from ordinary > directories?
I'm considering writing an "integer identifiers considered harmful" paper. I haven't dug into it yet - just skimmed the lwn coverage - but adding in the overlayfs issue makes this problem sound fundamentally unsolvable given the current constraints. Recognizing subvolume boundaries: we don't have anything for this yet, and given what happened with btrfs I'm not sure we want to go that route. We could (probably) expose unique inode numbers, filesystem wide, but at a cost. We'd have to restrict ourselves internally to 32 bit inode numbers, and then report inode numbers that include the subvolume ID as the high 32 bits. That won't leave enough bits to shard inode numbers at create time based on CPU id, which isn't the _most_ important optimization, but would hurt to lose. Going forward, we really need - as you allude to - better userspace APIs. Inode number probably can't be an integer anymore, it needs to be a string if it's going to be able to do all the things we want it to do.
