Re: How to cope with subvolumes and snapshots on muti-user systems?

NeilBrown Thu, 30 Nov 2023 12:36:20 -0800

On Thu, 30 Nov 2023, Kent Overstreet wrote:
> On Tue, Nov 28, 2023 at 08:49:23AM +0100, Donald Buczek wrote:
> > Hi,
> > 
> > I'm happy, bcachefs finally made it to mainline, Kudos to Kent!
> > 
> > As far as I know, there is an ongoing discussion about the problems of 
> > non-unique inode numbers exposed by snapshots an no real conclusion yet, on 
> > how filesystems should expose snapshots.
> > 
> > The current behavior is kind of a showstopper for us, because we are still 
> > running multi-user systems. The problem is, that any unprivileged user can 
> > create subvolumes and snapshots:
> > 
> >     buczek@dose:/scratch/local3$ bcachefs subvolume create vol1
> >     buczek@dose:/scratch/local3$ mkdir vol1/dir1
> >     buczek@dose:/scratch/local3$ bcachefs subvolume snapshot vol1/snp1
> >     buczek@dose:/scratch/local3$ ls -li vol1/
> >     total 0
> >     1342189197 drwxrwxr-x 2 buczek buczek 0 Nov 20 15:01 dir1
> >     1476413180 drwxrwxr-x 3 buczek buczek 0 Nov 20 15:01 snp1
> >     buczek@dose:/scratch/local3$ ls -li vol1/snp1/
> >     total 0
> >     1342189197 drwxrwxr-x 2 buczek buczek 0 Nov 20 15:01 dir1
> >     buczek@dose:/scratch/local3$ find .
> >     .
> >     ./vol1
> >     find: File system loop detected; ‘./vol1/snp1’ is part of the same file 
> > system loop as ‘./vol1’.
> >     ./vol1/dir1
> >     buczek@dose:/scratch/local3$
> > 
> > We have a few tools which walk over the filesystem (backup, mirror, 
> > accounting) and these are just not prepared for non-unique inode numbers 
> > aside from regular hardlinks. I'm concerned that experiments of a 
> > unprivileged users could make important tools to fail.
> > 
> > Questions:
> > 
> > - Would it be a workaround to make creation of subvolumes and snapshots 
> > privileged operations?
> > 
> > - If we want to evolve our tools: What is the best way for userspace to 
> > recognize subvolumes and snapshots and tell them apart from ordinary 
> > directories?
> 
> I'm considering writing an "integer identifiers considered harmful"
> paper. I haven't dug into it yet - just skimmed the lwn coverage - but
> adding in the overlayfs issue makes this problem sound fundamentally
> unsolvable given the current constraints.
> 
> Recognizing subvolume boundaries: we don't have anything for this yet,
> and given what happened with btrfs I'm not sure we want to go that
> route.
> 
> We could (probably) expose unique inode numbers, filesystem wide, but at
> a cost. We'd have to restrict ourselves internally to 32 bit inode
> numbers, and then report inode numbers that include the subvolume ID as
> the high 32 bits.
> 
> That won't leave enough bits to shard inode numbers at create time based
> on CPU id, which isn't the _most_ important optimization, but would hurt
> to lose.
> 
> Going forward, we really need - as you allude to - better userspace
> APIs. Inode number probably can't be an integer anymore, it needs to be
> a string if it's going to be able to do all the things we want it to do.
>


If you want to give up on Posix comparability - and with it any hope
that your fs will be used - then using a string as the inode identifier
might work.  But in the real world, inodes have 64 bit numbers.

The only credible alternative is to use the fhandle as a primary
identifier.  You still need to support inode numbers, and they still
need to be stable and to be as unique as you can possibly make them
across the filesystem.

My preference would be to make the inode number reported to userspace be
a strong hash of the fhandle (which absolutely must be unique - but can be
quite large).  Then systemic clashes - like the one with results in find
misdetecting filesystem loops - will be extremely unlikely.

For code that really really needs to detect if two filesystems are the
same (tar being the obvious example) we should encourage the use of
name_to_handle_at().

Experience with NFSv2 and exclusive opens shows that if a filesystem
breaks something in a way that it still works most of the time but
occasionally still fails randomly, then people will still use it
(providing it provides real benefits of course) and they will find a
work-around to avoid the breakage.  So thanks to NFSv2, people stopped
using O_CREAT|O_EXCL as a locking mechanism, and did things with hard
links instead.

NeilBrown

Re: How to cope with subvolumes and snapshots on muti-user systems?

Reply via email to