On 12/11/23 23:43, NeilBrown wrote:
> On Sat, 09 Dec 2023, Kent Overstreet wrote:
>> On Fri, Dec 08, 2023 at 12:34:28PM +0100, Donald Buczek wrote:
>>> On 12/8/23 03:49, Kent Overstreet wrote:
>>>
>>>> We really only need 6 or 7 bits out of the inode number for sharding;
>>>> then 20-32 bits (nobody's going to have a billion snapshots; a million
>>>> is a more reasonable upper bound) for the subvolume ID leaves 30 to 40
>>>> bits for actually allocating inodes out of.
>>>>
>>>> That'll be enough for the vast, vast majority of users, but exceeding
>>>> that limit is already something we're technically capable of: we're
>>>> currently seeing filesystems well over 100 TB, petabyte range expected
>>>> as fsck gets more optimized and online fsck comes.
>>>
>>> 30 bits would not be enough even today:
>>>
>>> buczek@done:~$ df -i /amd/done/C/C8024
>>> Filesystem         Inodes     IUsed      IFree IUse% Mounted on
>>> /dev/md0       2187890304 618857441 1569032863   29% /amd/done/C/C8024
>>>
>>> So that's 32 bit on a random production system ( 618857441 == 0x24e303e1 ).
> 
> only 30 bits though.  So it is a long way before you use all 32 bits.
> How many volumes do you have?

Sorry, I didn't answer that and and maybe the question is obsolete. 
Technically, you are
correct, of  course, though the "long way" is only factor 4, I just wanted to 
show that
live filesystems are in the order of magnitude of 2^32 inodes.

I'm not sure, I understand the question about how many volumes or why it 
matters. Sure, if
we'd combine the filesystems into one, we would be way over 2^32 files.

Subvolumes: none.

Filesystems: three (100TB) filesystems like this on that particular server and 
additional
three 100TB filesystems on another server for the same application. A 300TB
filesystem on another server for a similar application (but much less inodes 
currently)

We currently have about 112 filesystems >= 100TB online, if I counted 
correctly, and
many, many more smaller ones (30TB, 40TB, 50TB).

Best
  Donald


>>> And if the idea to produce unique inode numbers by hashing the filehandle 
>>> into 64 is followed, collisions definitely need to be addressed. With 
>>> 618857441 objects, the probability of a hash collision with 64 bit is 
>>> already over 1% [1].
>>
>> Oof, thanks for the data point. Yeah, 64 bits is clearly not enough for
>> a unique identifier; time to start looking at how to extend statx.
>>
> 
> 64 should be plenty...
> 
> If you have 32 bits for free allocation, and 7 bits for sharding across
> 128 CPUs, then you can allocate many more than 4 billion inodes.  Maybe
> not the full 500 billion for 39 bits, but if you actually spread the
> load over all the shards, then certainly tens of billions.
> 
> If you use 22 bits for volume number and 42 bits for inodes in a volume,
> then you can spend 7 on sharding and still have room for 55 of Donald's
> filesystems to be allocated by each CPU.
> 
> And if Donald only needs thousands of volumes, not millions, then he
> could configure for a whole lot more headroom.
> 
> In fact, if you use the 64 bits of vfs_inode number by filling in bits from
> the fs-inode number from one end, and bits from the volume number from
> the other end, then you don't need to pre-configure how the 64 bits are
> shared.
> You record inum-bits and volnum bits in the filesystem metadata, and
> increase either as needed.  Once the sum hits 64, you start returning
> ENOSPC for new files or new volumes.
> 
> There will come a day when 64 bits is not enough for inodes in a single
> filesystem.  Today is not that day.
> 
> NeilBrown

-- 
Donald Buczek
[email protected]
Tel: +49 30 8413 1433


Reply via email to