The cache manager times out RPC's to the file server after 50 seconds.
There is a thread which pings all the file servers (well, all those
which have callbacks outstanding) every 10 minutes, by performing a
GetTime RPC.
Which version of AFS are you porting? Probably 3.3. Here's how the
RPC works (basically). Ideally, there is a single packet from the
client to the server, followed by a single packet response. The bulk
transfer RPCs obviously behave slightly differently. IF the server
does not respond within a short period of time (on the order of a
second -- I can't be more precise because this is adaptive), the
client will resend any unacknowledged packets. If this packet is not
acknowledged, the client will retransmit it after about 2 seconds, and
again after another two seconds, until a response is received, or 50
seconds pass without any responses. It is very unlikely that none of
these packets will be successfully transmitted, no matter how bad the
Linux networking implementation. One minor optimization made in the
AFS kernel RPC is to detect a non-zero return code from the UDP send
call (signifying that it was not actually sent), and to schedule the
packet to be retransmitted nearly immediately, instead of waiting for
the timer to expire. You should verify that this is happening with
Linux. Note that nearly all network interfaces have sufficient queuing
that this should rarely be an issue -- it seems to be most prevalent
on pmaxes for some reason I haven't yet uncovered.
In my public area (/afs/transarc.com/public/lws/tcpdump) are mods to a
slightly downrev version of tcpdump which will decode most of the AFS
RPCs. You may find watching the traffic between client and server
to be instructive.
Incidentally, VBUSY is not supposed to be passed through to the users.
Since you are working from a source base, look at afs_resource.c:
afs_analyze(). VBUSY is supposed to cause the cm to sleep for 20
seconds and then retry. The 3.4 file server uses this while it is
(re)starting, in addition to the normal use while a volume is being
cloned.
An RPC return code < 0 is interpreted as a networking error. Most
such return codes will cause the server to be marked "down." Using
tcpdump, you will see this in the form of an abort packet, with a -1
argument.