On Sat, 09 Dec 2023, Kent Overstreet wrote:
> On Fri, Dec 08, 2023 at 12:34:28PM +0100, Donald Buczek wrote:
> > On 12/8/23 03:49, Kent Overstreet wrote:
> > 
> > > We really only need 6 or 7 bits out of the inode number for sharding;
> > > then 20-32 bits (nobody's going to have a billion snapshots; a million
> > > is a more reasonable upper bound) for the subvolume ID leaves 30 to 40
> > > bits for actually allocating inodes out of.
> > > 
> > > That'll be enough for the vast, vast majority of users, but exceeding
> > > that limit is already something we're technically capable of: we're
> > > currently seeing filesystems well over 100 TB, petabyte range expected
> > > as fsck gets more optimized and online fsck comes.
> > 
> > 30 bits would not be enough even today:
> > 
> > buczek@done:~$ df -i /amd/done/C/C8024
> > Filesystem         Inodes     IUsed      IFree IUse% Mounted on
> > /dev/md0       2187890304 618857441 1569032863   29% /amd/done/C/C8024
> > 
> > So that's 32 bit on a random production system ( 618857441 == 0x24e303e1 ).

only 30 bits though.  So it is a long way before you use all 32 bits.
How many volumes do you have?

> > 
> > And if the idea to produce unique inode numbers by hashing the filehandle 
> > into 64 is followed, collisions definitely need to be addressed. With 
> > 618857441 objects, the probability of a hash collision with 64 bit is 
> > already over 1% [1].
> 
> Oof, thanks for the data point. Yeah, 64 bits is clearly not enough for
> a unique identifier; time to start looking at how to extend statx.
> 

64 should be plenty...

If you have 32 bits for free allocation, and 7 bits for sharding across
128 CPUs, then you can allocate many more than 4 billion inodes.  Maybe
not the full 500 billion for 39 bits, but if you actually spread the
load over all the shards, then certainly tens of billions.

If you use 22 bits for volume number and 42 bits for inodes in a volume,
then you can spend 7 on sharding and still have room for 55 of Donald's
filesystems to be allocated by each CPU.

And if Donald only needs thousands of volumes, not millions, then he
could configure for a whole lot more headroom.

In fact, if you use the 64 bits of vfs_inode number by filling in bits from
the fs-inode number from one end, and bits from the volume number from
the other end, then you don't need to pre-configure how the 64 bits are
shared.
You record inum-bits and volnum bits in the filesystem metadata, and
increase either as needed.  Once the sum hits 64, you start returning
ENOSPC for new files or new volumes.

There will come a day when 64 bits is not enough for inodes in a single
filesystem.  Today is not that day.

NeilBrown

Reply via email to