On 11/30/23 21:37, NeilBrown wrote:
On Thu, 30 Nov 2023, Kent Overstreet wrote:
On Thu, Nov 30, 2023 at 08:35:12AM +0100, Donald Buczek wrote:
On 11/29/23 22:43, Kent Overstreet wrote:
On Tue, Nov 28, 2023 at 08:49:23AM +0100, Donald Buczek wrote:
Hi,

I'm happy, bcachefs finally made it to mainline, Kudos to Kent!

As far as I know, there is an ongoing discussion about the problems of 
non-unique inode numbers exposed by snapshots an no real conclusion yet, on how 
filesystems should expose snapshots.

The current behavior is kind of a showstopper for us, because we are still 
running multi-user systems. The problem is, that any unprivileged user can 
create subvolumes and snapshots:

      buczek@dose:/scratch/local3$ bcachefs subvolume create vol1
      buczek@dose:/scratch/local3$ mkdir vol1/dir1
      buczek@dose:/scratch/local3$ bcachefs subvolume snapshot vol1/snp1
      buczek@dose:/scratch/local3$ ls -li vol1/
      total 0
      1342189197 drwxrwxr-x 2 buczek buczek 0 Nov 20 15:01 dir1
      1476413180 drwxrwxr-x 3 buczek buczek 0 Nov 20 15:01 snp1
      buczek@dose:/scratch/local3$ ls -li vol1/snp1/
      total 0
      1342189197 drwxrwxr-x 2 buczek buczek 0 Nov 20 15:01 dir1
      buczek@dose:/scratch/local3$ find .
      .
      ./vol1
      find: File system loop detected; ‘./vol1/snp1’ is part of the same file 
system loop as ‘./vol1’.
      ./vol1/dir1
      buczek@dose:/scratch/local3$

We have a few tools which walk over the filesystem (backup, mirror, accounting) 
and these are just not prepared for non-unique inode numbers aside from regular 
hardlinks. I'm concerned that experiments of a unprivileged users could make 
important tools to fail.

Questions:

- Would it be a workaround to make creation of subvolumes and snapshots 
privileged operations?

- If we want to evolve our tools: What is the best way for userspace to 
recognize subvolumes and snapshots and tell them apart from ordinary 
directories?

I'm considering writing an "integer identifiers considered harmful"
paper. I haven't dug into it yet - just skimmed the lwn coverage - but
adding in the overlayfs issue makes this problem sound fundamentally
unsolvable given the current constraints.

Recognizing subvolume boundaries: we don't have anything for this yet,
and given what happened with btrfs I'm not sure we want to go that
route.

A fabricated fsid - I think, that is what btrfs does - wouldn't help
us either. The mentioned tools would just stop at the assumed mount
points.

*nod* - nor for NFS.

We could (probably) expose unique inode numbers, filesystem wide, but at
a cost. We'd have to restrict ourselves internally to 32 bit inode
numbers, and then report inode numbers that include the subvolume ID as
the high 32 bits.

That won't leave enough bits to shard inode numbers at create time based
on CPU id, which isn't the _most_ important optimization, but would hurt
to lose.

Going forward, we really need - as you allude to - better userspace
APIs. Inode number probably can't be an integer anymore, it needs to be
a string if it's going to be able to do all the things we want it to do.

Should be an uuid then. Expose inode uuid (and/or snapshot uuid) via
xattr? Anyway, I know, people more competent than me are thinking
about the problem.

Exposing the subvolume ID via an xattr would be easy, if that works for
you.

I think that would be a bad choice.  Stick with the filehandle.  It is
an existing interface supported by many filesystems and it works over
NFS.


Hmm, isn't a (struct file_handle) filehandle supposed to be opaque?  I'm not 
sure if it is a good idea to stick filesystem-specific meta-information (like a 
subvolume ident) into special bits of opaque numbers (whether inode number or 
filehandle) at all.

The information could also be made available via specialized ioctl(s) as an 
alternative to xattr.

One disadvantage of embedded in filehandle, xattr or ioctl is that userspace 
would need an extra syscall for each directory just to see if this is a 
subvolume or a snapshot, which most of the time it is not. Maybe, even an 
otherwise unnecessary open is needed to get a consistent view between multiple 
information-retrieving syscalls. If only statx() was extendable.

You don't have this problem with a crafted fsid. Then creating a subvolume or a 
snapshot could be seen as an implied mount. Maybe that's not the worst solution.

Best

  Donald


NeilBrown



What to you think about the suggestion to make creation of snapshots
(and possible volumes, not sure about that) a privileged operation
(optional, of course) ? I know, multi-user is niche nowadays, but for
me it looks like a rather cheap workaround.

Yeah, that's not an unreasonable ask.



--
Donald Buczek
[email protected]
Tel: +49 30 8413 1433

Reply via email to