Re: [lustre-discuss] size of MDT, inode count, inode size

Dilger, Andreas Sun, 04 Feb 2018 19:19:48 -0800

On Feb 4, 2018, at 13:10, E.S. Rosenberg <[email protected]> wrote:
> On Sat, Feb 3, 2018 at 4:45 AM, Dilger, Andreas <[email protected]> 
> wrote:
>> On Jan 26, 2018, at 07:56, Thomas Roth <[email protected]> wrote:
>> >
>> > Hmm, option-testing leads to more confusion:
>> >
>> > With this 922GB-sdb1 I do
>> >
>> > mkfs.lustre --reformat --mgs --mdt ... /dev/sdb1
>> >
>> > The output of the command says
>> >
>> >   Permanent disk data:
>> > Target:     test0:MDT0000
>> > ...
>> >
>> > device size = 944137MB
>> > formatting backing filesystem ldiskfs on /dev/sdb1
>> >       target name   test0:MDT0000
>> >       4k blocks     241699072
>> >       options        -J size=4096 -I 1024 -i 2560 -q -O 
>> > dirdata,uninit_bg,^extents,mmp,dir_nlink,quota,huge_file,flex_bg -E 
>> > lazy_journal_init -F
>> >
>> > mkfs_cmd = mke2fs -j -b 4096 -L test0:MDT0000  -J size=4096 -I 1024 -i 
>> > 2560 -q -O 
>> > dirdata,uninit_bg,^extents,mmp,dir_nlink,quota,huge_file,flex_bg -E 
>> > lazy_journal_init -F /dev/sdb1 241699072
>> 
>> The default options have to be conservative, as we don't know in advance how 
>> a filesystem will be used.  It may be that some sites will have lots of hard 
>> links or long filenames (which consume directory space == blocks, but not 
>> inodes), or they will have widely-striped files (which also consume xattr 
>> blocks).  The 2KB/inode ratio includes the space for the inode itself (512B 
>> in 2.7.x 1024B in 2.10), at least one directory entry (~64 bytes), some 
>> fixed overhead for the journal (up to 4GB on the MDT), and Lustre-internal 
>> overhead (OI entry = ~64 bytes), ChangeLog, etc.
>> 
>> If you have a better idea of space usage at your site, you can specify 
>> different parameters.
>> 
>> > Mount this as ldiskfs, gives 369 M inodes.
>> >
>> > One would assume that specifying one / some of the mke2fs-options here in 
>> > the mkfs.lustre-command will change nothing.
>> >
>> > However,
>> >
>> > mkfs.lustre --reformat --mgs --mdt ... --mkfsoptions="-I 1024" /dev/sdb1
>> >
>> > says
>> >
>> > device size = 944137MB
>> > formatting backing filesystem ldiskfs on /dev/sdb1
>> >       target name   test0:MDT0000
>> >       4k blocks     241699072
>> >       options       -I 1024 -J size=4096 -i 1536 -q -O 
>> > dirdata,uninit_bg,^extents,mmp,dir_nlink,quota,huge_file,flex_bg -E 
>> > lazy_journal_init -F
>> >
>> > mkfs_cmd = mke2fs -j -b 4096 -L test0:MDT0000 -I 1024 -J size=4096 -i 1536 
>> > -q -O dirdata,uninit_bg,^extents,mmp,dir_nlink,quota,huge_file,flex_bg -E 
>> > lazy_journal_init -F /dev/sdb1 241699072
>> >
>> > and the mounted devices now has 615 M inodes.
>> >
>> > So, whatever makes the calculation for the "-i / bytes-per-inode" value 
>> > becomes ineffective if I specify the inode size by hand?
>> 
>> This is a bit surprising.  I agree that specifying the same inode size value 
>> as the default should not affect the calculation for the bytes-per-inode 
>> ratio.
>> 
>> > How many bytes-per-inode do I need?
>> >
>> > This ratio, is it what the manual specifies as "one inode created for each 
>> > 2kB of LUN" ?
>> 
>> That was true with 512B inodes, but with the increase to 1024B inodes in 
>> 2.10 (to allow for PFL file layouts, since they are larger) the inode ratio 
>> has also gone up 512B to 2560B/inode.
> 
> Does this mean that someone who updates their servers from 2.x to 2.10 will 
> not be able to use PFL since the MDT was formatted in a way that can't 
> support it? (in our case formatted under Lustre 2.5 currently running 2.8)

It will be possible to use PFL layouts with older MDTs, but there may be a
performance impact if the MDTs are HDD based because a multi-component PFL
layout is unlikely to fit into the 512-byte inode, so they will allocate an
extra xattr block for each PFL file.  For SSD-based MDTs the extra seek is
not likely to impact performance significantly, but for HDD-based MDTs this
extra seek for accessing every file will reduce the metadata performance.

If you formatted the MDT filesystem for a larger default stripe count (e.g.
use "mkfs.lustre ... --stripe-count-hint=8" or more) then you will already
have 1024-byte inodes, and this is a non-issue.

That said, the overall impact to your applications may be minimal if you do
not have metadata-intensive workloads, and PFL can help improve the IO
performance of applications because many users do not set proper striping on
their files, so the IO performance of files can improve.

Of course, if you know in advance what the best striping for a file is, and
your applications or users already use that, then PFL is not necessary and
there is no performance impact if PFL is not used.

Cheers, Andreas

>> > Perhaps the raw size of an MDT device should better be such that it leads
>> > to "-I 1024 -i 2048"?
>> 
>> Yes, that is probably reasonable, since the larger inode also means that 
>> there is less chance of external xattr blocks being allocated.
>> 
>> Note that with ZFS there is no need to specify the inode ratio at all.  It 
>> will dynamically allocate inode blocks as needed, along with directory 
>> blocks, OI tables, etc., until the filesystem is full.
>> 
>> Cheers, Andreas
>> 
>> > On 01/26/2018 03:10 PM, Thomas Roth wrote:
>> >> Hi all,
>> >> what is the relation between raw device size and size of a formatted MDT? 
>> >> Size of inodes + free space = raw size?
>> >> The example:
>> >> MDT device has 922 GB in /proc/partions.
>> >> Formatted under Lustre 2.5.3 with default values for mkfs.lustre resulted 
>> >> in a 'df -h' MDT of 692G and more importantly 462M inodes.
>> >> So, the space used for inodes + the 'df -h' output add up to the raw size:
>> >>  462M inodes * 0.5kB/inode + 692 GB = 922 GB
>> >> On that system there are now 330M files, more than 70% of the available 
>> >> inodes.
>> >> 'df -h' says '692G  191G  456G  30% /srv/mds0'
>> >> What do I need the remaining 450G for? (Or the ~400G left once all the 
>> >> inodes are eaten?)
>> >> Should the format command not be tuned towards more inodes?
>> >> Btw, on a Lustre 2.10.2 MDT I get 369M inodes and 550 G space (with a 
>> >> 922G raw device): inode size is now 1024.
>> >> However, according to the manual and various Jira/Ludocs the size should 
>> >> be 2k nowadays?
>> >> Actually, the command within mkfs.lustre reads
>> >> mke2fs -j -b 4096 -L test0:MDT0000  -J size=4096 -I 1024 -i 2560  -F 
>> >> /dev/sdb 241699072
>> >> -i 2560 ?
>> >> Cheers,
>> >> Thomas

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation

_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] size of MDT, inode count, inode size

Reply via email to