On Jan 27, 2011, at 11:34 PM, Robin Humble wrote:

> On Thu, Jan 13, 2011 at 05:28:23PM -0500, Kit Westneat wrote:
>>> It would probably be better to set:
>>> 
>>> lctl conf_param fsname-OST00XX.ost.readcache_max_filesize=32M
>>> 
>>> or similar, to limit the read cache to files 32MB in size or less (or 
>>> whatever you consider "small" files at your site.  That allows the read 
>>> cache for config files and such, while not thrashing the cache while 
>>> accessing large files.
>>> 
>>> We should probably change this to be the default, but at the time the read 
>>> cache was introduced, we didn't know what should be considered a small vs. 
>>> large file, and the amount of RAM and number of OSTs on an OSS, and the 
>>> uses varies so much that it is difficult to pick a single correct value for 
>>> this.
> 
> limiting the total amount of OSS cache used in order to leave room for
> inodes/dentries might be more useful. the data cache will always fill
> up and push out inodes otherwise.

The inode and dentry objects in the slab cache aren't so much of an issue as 
having the disk blocks that each are generated from available in the buffer 
cache. Constructing the in-memory inode and dentry objects is cheap as long as 
the corresponding disk blocks are available. Doing the disk reads, depending on 
your hardware and some other factors, is not.

> Nathan's approach of turning off the caches entirely is extreme, but if
> it gives us back some metadata performance then it might be worth it.

We went the extreme and disabled the OSS read cache (+ writethrough cache). In 
addition, on the OSSes we pre-read all of the inode blocks that contain at 
least one used inode, along with all of the directory blocks. 

The results have been promising so far. Firing off a du on an entire 
filesystem, 3000-6000 stats/second is typical. I've noted a few causes of 
slowdowns so far; there may be more.

First, no attempt has been made to pre-read metadata from the MDT. The need to 
read in inode and directory blocks may slow things down quite a bit. I can't 
find the numbers in my notes at the moment, but I recall seeing 200-500 
stats/second when the MDS needed to do I/O.

When memory runs low on a client, kswapd kicks in to try and free up pages. On 
the client I'm currently testing on, almost all of the memory used is in the 
slab. It looks like kswapd has a difficult time clearing things up, and the 
client can go several seconds before the current stat call is completed. 
Dropping caches will (temporarily) get the performance back to expected rates. 
I haven't dug into this one too much yet.

Sometimes the performance drop is worse, and we see just tens of stats/second 
(or fewer!) This is due to the fact that filter_{fid2dentry,precreate,destory} 
all need to take a lock on the parent directory of the object on the OST. 
Unlink or precreate operations whose critical section protected by this lock 
take a long time to complete will slow down stat requests. I'm working on 
tracking down the cause of this; it may be journal related. BZ 22107 is 
probably relevant as well.

> or is there a Lustre or VM setting to limit overall OSS cache size?

No, but I think that would be really useful in this situation.

> I presume that Lustre's OSS caches are subject to normal Linux VM
> pagecache tweakables, but I don't think such a knob exists in Linux at
> the moment...

Correct on both counts. A patch was proposed to do this, but I don't see any 
evidence of it making it into the kernel:

http://lwn.net/Articles/218890/

I have a small set of perl, bash, and SystemTap scripts to read the inode and 
directory blocks from disk and monitor the performance of the relevant Lustre 
calls on the servers. I'll clean them up and send them to the list next week. A 
more elegant solution would be to get e2scan to do the job, but I haven't taken 
a hack at that yet.

Our largest filesystem, in terms of inodes, has about 1.8M inodes per OST, and 
15 OSTs per OSS. Of the 470400 inode blocks on disk (58800 block groups * 8 
inode blocks/group), ~36% have at least one inode used. We pre-read those and 
ignore the empty inode blocks. Looking at the OSTs on one OSS, we have an 
average of 3891 directory blocks per OST.

In the absence of controls on the size of the page cache, or enough RAM to 
cache all of the inode and directory blocks in memory, another potential 
solution is to place the metadata on an SSD. One can generate a dm linear 
target table that carves up an ext3/ext4 filesystem such that the inode blocks 
go on one device, and the data blocks go on another. Ideally the inode blocks 
would be placed on an SSD. 

I've tried this with both ext3, and with ext4 using flex_bg to reduce the size 
of the dm table. IIRC the overhead is acceptable in both cases - 1us, on 
average.

Placing the inodes on separate storage is not sufficient, though. Slow 
directory block reads contribute to poor stat performance as well. Adding a 
feature to ext4 to reserve a number of fixed block groups for directory blocks, 
and always allocating them there, would help. Those blocks groups could then be 
placed on an SSD as well.

Even with the inode and directory blocks on fast storage, stat performance will 
still suffer when other operations that require a lock on the object's parent 
directory are going slow.

I've left out a few details and actual performance numbers from our production 
systems. I'll do a more detailed writeup after I take care of some other things 
at work, and finish recovering from 13.5 timezones worth of jet lag :-) 

Jason

--
Jason Rappleye
System Administrator
NASA Advanced Supercomputing Division
NASA Ames Research Center
Moffett Field, CA 94035





_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to