Garance,

The important thing to note is that the reason that there are RPCs
waiting for threads is that every server thread that can process an RPC
is already tied up doing something.  Those threads could be:

 1 waiting for a reply from the PTDB

 2 waiting for a reply from a TellMeAboutYourself sent to
   a client

 3 waiting for a reply from a ProbeUUID sent to a client

 4 waiting for a reply from a InitCallBackState3 sent to
   a client

 5 waiting for a reply from a Callback sent to a client

 6 waiting for a lock that is held by another thread

 7 waiting for disk I/O to complete

For 1 through 5 there will be a rx connection from the server to the
client for each blocking RPC.  For an RPC issued to a client in response
to an in-bound RPC you should find a pair of entries in the rxdebug
output with the same address and port number but one being a client
connection and the other a server connection.  For example:

Connection from host X.Y.Z.102, port 34935, Cuid 96817158/47fe3724
  serial 3,  natMTU 1444, security index 0, client conn
    call 0: # 1, state dally, mode: receiving
    call 1: # 0, state not initialized
    call 2: # 0, state not initialized
    call 3: # 0, state not initialized
Connection from host X.Y.Z.102, port 34935, Cuid b35be5a7/48276030
  serial 3,  natMTU 1444, security index 0, server conn
    call 0: # 1, state active, mode: receiving
    call 1: # 0, state not initialized
    call 2: # 0, state not initialized
    call 3: # 0, state not initialized

In the case of a client that is switching NAT ports, you may see
connections have the same address but a different port number in each
direction.  In such a case the outbound connection to the old port
number for  the ProbeUUID RPC is going to block for the timeout period.
However, such a thread would not be holding a lock required by other
threads, it would just be keeping that thread busy until a timeout
generated or a response is received.

The worse case is where the thread is holding a lock on a resource such
as a vnode while callbacks are being issued to one or more clients that
are not responding.  This is typically a data modifying operation but
could also be a lock state change.   In such a case the thread holding
the vnode lock cannot safely return until after all outstanding
callbacks have been issued or scheduled for a delayed break.  While a
vnode lock is held no other threads that require that vnode can proceed.
 If there are a large number of clients trying to modify the same
directory, the contention ties up a large number of threads because only
one can make progress at a time and each RPC requires a callback to be
issued.  If there are clients that are unresponsive, there are delays.

Unfortunately, the logging in the file server is not all that helpful to
identify which vnode are involved because none of them include the FID
of the object a callback might be issued for.  A patch to log the FID
when a callback fails on master would be appreciated.

The behaviors of clients and servers have changed over the years.  It
really does matter which version the file server is and what the command
line configuration is.   Your file servers are behind a firewall and I
can't query them but your DB servers are 1.4.4 which is really out of
date.   There have been numerous fixes to the host package that tracks
the clients between 1.4.4 and 1.4.14 let alone between 1.4 and 1.6.
Not to mention many improvements in RX processing and security
vulnerabilities.   You might want to consider an upgrade.

Jeffrey Altman



Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to