On 2011-01-28, at 10:45, Jason Rappleye wrote:
> Sometimes the performance drop is worse, and we see just tens of stats/second
> (or fewer!) This is due to the fact that
> filter_{fid2dentry,precreate,destory} all need to take a lock on the parent
> directory of the object on the OST. Unlink or precreate operations whose
> critical section protected by this lock take a long time to complete will
> slow down stat requests. I'm working on tracking down the cause of this; it
> may be journal related. BZ 22107 is probably relevant as well.
There is work underway to allow the locking of the ldiskfs directories to be
multi-threaded. This should significantly improve performance in such cases.
> Our largest filesystem, in terms of inodes, has about 1.8M inodes per OST,
> and 15 OSTs per OSS. Of the 470400 inode blocks on disk (58800 block groups *
> 8 inode blocks/group), ~36% have at least one inode used. We pre-read those
> and ignore the empty inode blocks. Looking at the OSTs on one OSS, we have an
> average of 3891 directory blocks per OST.
>
> In the absence of controls on the size of the page cache, or enough RAM to
> cache all of the inode and directory blocks in memory, another potential
> solution is to place the metadata on an SSD. One can generate a dm linear
> target table that carves up an ext3/ext4 filesystem such that the inode
> blocks go on one device, and the data blocks go on another. Ideally the inode
> blocks would be placed on an SSD.
>
> I've tried this with both ext3, and with ext4 using flex_bg to reduce the
> size of the dm table. IIRC the overhead is acceptable in both cases - 1us, on
> average.
I'd be quite interested to see the results of such testing.
> Placing the inodes on separate storage is not sufficient, though. Slow
> directory block reads contribute to poor stat performance as well. Adding a
> feature to ext4 to reserve a number of fixed block groups for directory
> blocks, and always allocating them there, would help. Those blocks groups
> could then be placed on an SSD as well.
I believe there is a heuristic that allocates directory blocks in the first
group of a flex_bg, so if that entire group is on SSD it would potentially
avoid this problem.
Cheers, Andreas
--
Andreas Dilger
Principal Engineer
Whamcloud, Inc.
_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss