On 26 Oct 2009, at 15:15, Rainer Toebbicke wrote:

What I forgot to mention was that during that test those zillion files are eventually removed. While unlinking the dentry_cache shrinks, whereas to my surprise the afs_inode_cache doesn't.

We only actually release space in the afs_inode_cache when the kernel asks us to destroy the inode. I think it will only do so when it believes it is running low on memory. 1.4.11 also contains changes (the dynamic inodes code) which mean that we try less hard to release inodes,
unless we are forced into doing so.

In terms of the buffer code, I spent a bit of a train journey today looking into it. The error message you're getting means that afs_newslot can't find a buffer that isn't in use to supply to the 'dir' package, to process the directory structure. Buffers aren't marked in use beyond a single call to the directory package (that is, if it returns holding a buffer, then that's a bug). Whilst Linux has a relatively low number of buffers configured (50), and the directory code uses 2 buffers for some lookups, this error would mean that you have 25 or more directory operations occuring simulatenously.

I find it hard to believe that it would be possible to get 25 processes all in the directory code at once (although, with a tree of large directories and a massively parallel writer, it's not impossible), so I started to look for unbalanaced reference counts, or locking issues in the buffer and directory code.

I found one locking issue, which was fixed back in 2002 in dir/ buffer.c, but not in afs/afs_buffer.c. The fix for that is in gerrit as 737. However, I think I've convinced myself that the GLOCK serialises things sufficiently that this is purely a theoretical problem - I'd be surprised if you were seeing this in practice, and if you are, I think it would manifest itself in different ways.

The second issue that I found was with the way that newslot picks the oldest buffer to replace. There is an int32 counter, which is incremented each time a buffer is accessed, and the current value stored within that buffer as its 'accesstime'. If a buffer has a stored accesstime of 0x7ffffff, then newslot will never evict that buffer. I can't, however, see a practical way in which you can get 50 buffers into this position, though. The fix for this is gerrit #738 ( http://gerrit.openafs.org/738 ) It might be worth giving this a whirl, and seeing if it helps.

All of that is a long winded way of saying I don't really know what's causing your issue. One key diagnostic question is whether the cache manager continues to operate once it's run out of buffers. If we have a reference count imbalance somewhere, then the machine will never recover, and will report a lack of buffers for every operation it performs. If the cache manager does recover, then it may just mean that we need to look at either having a larger number of buffers, or making our buffer allocation dynamic. Both should be pretty straightforward, for Linux at least.

What happens to your clients once they've hit the error?

S.



_______________________________________________
OpenAFS-info mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-info

Reply via email to