Hi -
As Jeff mentions, the fileserver and the clients would be
doing lots of
from fileserver - BreakCallback
from clients - FetchData
if the other users were trying to read files from that directory
that contained the files you were deleting.
There is a Meltdown script that is available on the IBM AFS
Support Web page and you can monitor the fileserver threads
while this is going on and you could see that the threads
were probably decreasing.
How many fileserver threads are you using, 128 ?
This Meltdown script is just a wrapper around the rxdebug
command and we would normally let it run in 10 second
increments.
Meltdown.pl -s <server> -p <port> -t <seconds>
so
Meltdown.pl -s <fileserver> -p 7000 -t 10
There is a link at the bottom of the following page where you
can download the Meltdown.pl perl script. You may need to
edit the first line to point to your perl location
http://www-1.ibm.com/support/docview.wss?rs=0&q=%2bAFS+tool&uid=swg21112323&loc=en_US&cs=utf-8&cc=US%23%29=en
Thanks
Todd
Jeffrey Altman <[EMAIL PROTECTED]>
Sent by: [EMAIL PROTECTED] 04/13/2006 04:08 PM |
|
Robert Banz wrote:
>>
>> Could you do some rxdebug calls to the fileserver next time? So we
>> know why it's getting unresponsive.
>> It could be running out of threads. I don't expect that, but it could
>> be ...
>
> The 'symptoms' seem to be, for the most part, volume-specific. Slow
> response to accessing that volume, followed by the clients seeing a
> timeout on it. So, guts-o-the-fileserver folk, is there a volume-wide
> lock that gets set by a particular fileserver thread when it's being
> acted upon? Since deleting a whole-bunch-of-files (a /bin/rm -fr <dir>)
> is happening, that's a whole lot of requests coming in in-series to that
> volume, being taken care of on (probably) a first-come, first-served
> basis, leaving little room for other clients to get an op in on that
> volume?
Lets say that there are N clients who are all attempting to use
the contents of directory "D" in the read-write volume "V". Client 1
is making changes to the contents of the directory and clients 2 to N
are reading the directory.
In order for each of the N clients to read the directory, they need
to perform a FetchData RPC which registers a callback with the file
server on "D". Now each time that client 1 makes a change to the
contents of "D", each of the callbacks that are currently registered
with the file server must be broken in order to notify clients 2 to N
that the data they have cached is no longer valid. If clients 2 to N
are actively using "D", then when the callback break is received from
the file server they will in turn attempt to perform a new FetchData
operation to obtain the latest data value. This in turn registers a
new callback.
Now if client 1 is performing 30,000 individual RemoveFile RPCs it is
going to extremely hard for any of the other clients to maintain to be
able to maintain a callback until all 30,000 RPCs are completed. As
soon as a FetchData operation completes, the callback will be broken
and the cache contents will be invalidated.
I don't believe there is a bug here its just a negative side effect
of client side caching and the fact that the file system is only given
one file name at a time to act on.
Jeffrey Altman
#### smime.p7s has been removed from this note on April 14, 2006 by Todd DeSantis
(See attached file: smime.p7s)
smime.p7s
Description: Binary data

