On Thu, Nov 30, 2023 at 08:35:12AM +0100, Donald Buczek wrote: > On 11/29/23 22:43, Kent Overstreet wrote: > > On Tue, Nov 28, 2023 at 08:49:23AM +0100, Donald Buczek wrote: > > > Hi, > > > > > > I'm happy, bcachefs finally made it to mainline, Kudos to Kent! > > > > > > As far as I know, there is an ongoing discussion about the problems of > > > non-unique inode numbers exposed by snapshots an no real conclusion yet, > > > on how filesystems should expose snapshots. > > > > > > The current behavior is kind of a showstopper for us, because we are > > > still running multi-user systems. The problem is, that any unprivileged > > > user can create subvolumes and snapshots: > > > > > > buczek@dose:/scratch/local3$ bcachefs subvolume create vol1 > > > buczek@dose:/scratch/local3$ mkdir vol1/dir1 > > > buczek@dose:/scratch/local3$ bcachefs subvolume snapshot vol1/snp1 > > > buczek@dose:/scratch/local3$ ls -li vol1/ > > > total 0 > > > 1342189197 drwxrwxr-x 2 buczek buczek 0 Nov 20 15:01 dir1 > > > 1476413180 drwxrwxr-x 3 buczek buczek 0 Nov 20 15:01 snp1 > > > buczek@dose:/scratch/local3$ ls -li vol1/snp1/ > > > total 0 > > > 1342189197 drwxrwxr-x 2 buczek buczek 0 Nov 20 15:01 dir1 > > > buczek@dose:/scratch/local3$ find . > > > . > > > ./vol1 > > > find: File system loop detected; ‘./vol1/snp1’ is part of the same > > > file system loop as ‘./vol1’. > > > ./vol1/dir1 > > > buczek@dose:/scratch/local3$ > > > > > > We have a few tools which walk over the filesystem (backup, mirror, > > > accounting) and these are just not prepared for non-unique inode numbers > > > aside from regular hardlinks. I'm concerned that experiments of a > > > unprivileged users could make important tools to fail. > > > > > > Questions: > > > > > > - Would it be a workaround to make creation of subvolumes and snapshots > > > privileged operations? > > > > > > - If we want to evolve our tools: What is the best way for userspace to > > > recognize subvolumes and snapshots and tell them apart from ordinary > > > directories? > > > > I'm considering writing an "integer identifiers considered harmful" > > paper. I haven't dug into it yet - just skimmed the lwn coverage - but > > adding in the overlayfs issue makes this problem sound fundamentally > > unsolvable given the current constraints. > > > > Recognizing subvolume boundaries: we don't have anything for this yet, > > and given what happened with btrfs I'm not sure we want to go that > > route. > > A fabricated fsid - I think, that is what btrfs does - wouldn't help > us either. The mentioned tools would just stop at the assumed mount > points.
*nod* - nor for NFS. > > We could (probably) expose unique inode numbers, filesystem wide, but at > > a cost. We'd have to restrict ourselves internally to 32 bit inode > > numbers, and then report inode numbers that include the subvolume ID as > > the high 32 bits. > > > > That won't leave enough bits to shard inode numbers at create time based > > on CPU id, which isn't the _most_ important optimization, but would hurt > > to lose. > > > > Going forward, we really need - as you allude to - better userspace > > APIs. Inode number probably can't be an integer anymore, it needs to be > > a string if it's going to be able to do all the things we want it to do. > > Should be an uuid then. Expose inode uuid (and/or snapshot uuid) via > xattr? Anyway, I know, people more competent than me are thinking > about the problem. Exposing the subvolume ID via an xattr would be easy, if that works for you. > What to you think about the suggestion to make creation of snapshots > (and possible volumes, not sure about that) a privileged operation > (optional, of course) ? I know, multi-user is niche nowadays, but for > me it looks like a rather cheap workaround. Yeah, that's not an unreasonable ask.
