On 9/4/2009 5:53 AM, Rainer Toebbicke wrote:
> I believe client-idledeadtime-support-20080430 is a bit unfair, at least
> it introduced the following problem, possibly on a wrong assumption:
> 
> time 502.579567, pid 2195: Analyze RPC op 6 conn 0x4ae0b240 code
> 0xfffffffd user 0x0
> time 502.579581, pid 2195: afs_Analyze out shouldRetry 0
> time 502.579608, pid 2195: Returning code -3 from 21
> 
> in words: an RXAFS_RemoveFile received an RX_CALL_TIMEOUT (-3), there is
> no alternative server, and hence the operation is not retried. It would
> have been retried in the RX_CALL_DEAD (-1) case (and plenty of other
> cases).
> 
> The comment in the code justifies that special treatment for
> RX_CALL_TIMEOUT on the grounds that the call has timed out while server
> was still responding to other calls. From the RX code I fail to see how
> this is necessarily correct: rxi_CheckCall can be called from the
> keepalive mechanism and stop the call without any interaction from the
> server, you can get RX_CALL_TIMEOUT and not e.g. RX_CALL_DEAD even if
> the server is completely hosed.
> 
> In this particular case the server was stopped, the "hard mount"
> functionality should have ensured that I/Os stay pending until the
> server was restarted.
> 
> Now, I don't know how to solve the yoyo up-down-up problem in case a
> server times out calls selectively. When it happens, for
> single-server-volumes the call should be retried in most cases.
> 
> Would it be reasonable to blacklist only up to the last server and then
> go the usual path which ensures a fair retry?
> 
> (Actually, I wonder whether you can ensure you never get RX_CALL_TIMEOUT
> if a server is simply restarted, which certainly deserves a retry).

There was a bug in the idle dead timeout processing in the rx library.
A timeout would occur if the send window was full for longer than the
timeout period.  This is fixed by

  http://gerrit.openafs.org/#change,1183

Jeffrey Altman

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to