Hi,
We've been running lustre 2.15.1 in production for over a year and recently
decided to enable PFL with DoM on our filesystem. Things have been fine up
until last week, when users started reporting issues copying files,
specifically "No space left on device". The MDT is running ldiskfs as the
backend.
I've searched through the mailing list and found a couple of people reporting
similar problems, which prompted me to check the inode allocation, which is
currently:
UUID Inodes IUsed IFree IUse% Mounted on
scratchc-MDT0000_UUID 624492544 71144384 553348160 12%
/mnt/scratchc[MDT:0]
scratchc-OST0000_UUID 57712579 24489934 33222645 43%
/mnt/scratchc[OST:0]
scratchc-OST0001_UUID 57114064 24505876 32608188 43%
/mnt/scratchc[OST:1]
filesystem_summary: 136975217 71144384 65830833 52% /mnt/scratchc
So, nowhere near full - the disk usage is a little higher:
UUID bytes Used Available Use% Mounted on
scratchc-MDT0000_UUID 882.1G 451.9G 355.8G 56%
/mnt/scratchc[MDT:0]
scratchc-OST0000_UUID 53.6T 22.7T 31.0T 43%
/mnt/scratchc[OST:0]
scratchc-OST0001_UUID 53.6T 23.0T 30.6T 43%
/mnt/scratchc[OST:1]
filesystem_summary: 107.3T 45.7T 61.6T 43% /mnt/scratchc
But not full either! The errors are accompanied in the logs by:
LustreError: 15450:0:(tgt_grant.c:463:tgt_grant_space_left()) scratchc-MDT0000:
cli ba0195c7-1ab4-4f7c-9e28-8689478f5c17/ffff9e331e231c00 left 82586337280 <
tot_grant 82586681321 unstable 0 pending 0 dirty 1044480
LustreError: 15450:0:(tgt_grant.c:463:tgt_grant_space_left()) Skipped 33050
previous similar messages
For reference the DoM striping we're using is:
lcm_layout_gen: 0
lcm_mirror_count: 1
lcm_entry_count: 3
lcme_id: N/A
lcme_mirror_id: N/A
lcme_flags: 0
lcme_extent.e_start: 0
lcme_extent.e_end: 1048576
stripe_count: 0 stripe_size: 1048576 pattern: mdt
stripe_offset: -1
lcme_id: N/A
lcme_mirror_id: N/A
lcme_flags: 0
lcme_extent.e_start: 1048576
lcme_extent.e_end: 1073741824
stripe_count: 1 stripe_size: 1048576 pattern: raid0
stripe_offset: -1
lcme_id: N/A
lcme_mirror_id: N/A
lcme_flags: 0
lcme_extent.e_start: 1073741824
lcme_extent.e_end: EOF
stripe_count: -1 stripe_size: 1048576 pattern: raid0
stripe_offset: -1
So the first 1MB on the MDT.
My question is obviously what is causing these errors? I'm not massively
familiar with Lustre internals, so any pointers on where to look would be
greatly appreciated!
Cheers
Jon
Jon Marshall
High Performance Computing Specialist
IT and Scientific Computing Team
Cancer Research UK Cambridge Institute
Li Ka Shing Centre | Robinson Way | Cambridge | CB2 0RE
Web<http://www.cruk.cam.ac.uk/> |
Facebook<http://www.facebook.com/cancerresearchuk> |
Twitter<http://twitter.com/CR_UK>
[Description: CRI Logo]<http://www.cruk.cam.ac.uk/>
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org