Re: [Lustre-discuss] question about size on MDS (MDT) for lustre-1.8

Andreas Dilger Fri, 28 Jan 2011 10:05:02 -0800

On 2011-01-28, at 10:45, Jason Rappleye wrote:
> Sometimes the performance drop is worse, and we see just tens of stats/second 
> (or fewer!) This is due to the fact that 
> filter_{fid2dentry,precreate,destory} all need to take a lock on the parent 
> directory of the object on the OST. Unlink or precreate operations whose 
> critical section protected by this lock take a long time to complete will 
> slow down stat requests. I'm working on tracking down the cause of this; it 
> may be journal related. BZ 22107 is probably relevant as well.


There is work underway to allow the locking of the ldiskfs directories to be 
multi-threaded.  This should significantly improve performance in such cases.

> Our largest filesystem, in terms of inodes, has about 1.8M inodes per OST, 
> and 15 OSTs per OSS. Of the 470400 inode blocks on disk (58800 block groups * 
> 8 inode blocks/group), ~36% have at least one inode used. We pre-read those 
> and ignore the empty inode blocks. Looking at the OSTs on one OSS, we have an 
> average of 3891 directory blocks per OST.
> 
> In the absence of controls on the size of the page cache, or enough RAM to 
> cache all of the inode and directory blocks in memory, another potential 
> solution is to place the metadata on an SSD. One can generate a dm linear 
> target table that carves up an ext3/ext4 filesystem such that the inode 
> blocks go on one device, and the data blocks go on another. Ideally the inode 
> blocks would be placed on an SSD. 
> 
> I've tried this with both ext3, and with ext4 using flex_bg to reduce the 
> size of the dm table. IIRC the overhead is acceptable in both cases - 1us, on 
> average.

I'd be quite interested to see the results of such testing.

> Placing the inodes on separate storage is not sufficient, though. Slow 
> directory block reads contribute to poor stat performance as well. Adding a 
> feature to ext4 to reserve a number of fixed block groups for directory 
> blocks, and always allocating them there, would help. Those blocks groups 
> could then be placed on an SSD as well.

I believe there is a heuristic that allocates directory blocks in the first 
group of a flex_bg, so if that entire group is on SSD it would potentially 
avoid this problem.

Cheers, Andreas
--
Andreas Dilger 
Principal Engineer
Whamcloud, Inc.



_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] question about size on MDS (MDT) for lustre-1.8

Reply via email to