Re: [OpenAFS] console messages: "all buffers locked"

Simon Wilkinson Mon, 26 Oct 2009 14:18:07 -0700


On 26 Oct 2009, at 15:15, Rainer Toebbicke wrote:

What I forgot to mention was that during that test those zillionfiles are eventually removed. While unlinking the dentry_cacheshrinks, whereas to my surprise the afs_inode_cache doesn't.

We only actually release space in the afs_inode_cache when the kernelasks us to destroy the inode. I think it will only do so when itbelieves it is running low on memory. 1.4.11 also contains changes(the dynamic inodes code) which mean that we try less hard to releaseinodes,

unless we are forced into doing so.

In terms of the buffer code, I spent a bit of a train journey todaylooking into it. The error message you're getting means thatafs_newslot can't find a buffer that isn't in use to supply to the'dir' package, to process the directory structure. Buffers aren'tmarked in use beyond a single call to the directory package (that is,if it returns holding a buffer, then that's a bug). Whilst Linux has arelatively low number of buffers configured (50), and the directorycode uses 2 buffers for some lookups, this error would mean that youhave 25 or more directory operations occuring simulatenously.

I find it hard to believe that it would be possible to get 25processes all in the directory code at once (although, with a tree oflarge directories and a massively parallel writer, it's notimpossible), so I started to look for unbalanaced reference counts, orlocking issues in the buffer and directory code.

I found one locking issue, which was fixed back in 2002 in dir/buffer.c, but not in afs/afs_buffer.c. The fix for that is in gerritas 737. However, I think I've convinced myself that the GLOCKserialises things sufficiently that this is purely a theoreticalproblem - I'd be surprised if you were seeing this in practice, and ifyou are, I think it would manifest itself in different ways.

The second issue that I found was with the way that newslot picks theoldest buffer to replace. There is an int32 counter, which isincremented each time a buffer is accessed, and the current valuestored within that buffer as its 'accesstime'. If a buffer has astored accesstime of 0x7ffffff, then newslot will never evict thatbuffer. I can't, however, see a practical way in which you can get 50buffers into this position, though. The fix for this is gerrit #738 ( http://gerrit.openafs.org/738) It might be worth giving this a whirl, and seeing if it helps.

All of that is a long winded way of saying I don't really know what'scausing your issue. One key diagnostic question is whether the cachemanager continues to operate once it's run out of buffers. If we havea reference count imbalance somewhere, then the machine will neverrecover, and will report a lack of buffers for every operation itperforms. If the cache manager does recover, then it may just meanthat we need to look at either having a larger number of buffers, ormaking our buffer allocation dynamic. Both should be prettystraightforward, for Linux at least.


What happens to your clients once they've hit the error?

S.



_______________________________________________
OpenAFS-info mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-info

Re: [OpenAFS] console messages: "all buffers locked"

Reply via email to