On 9/4/2009 5:53 AM, Rainer Toebbicke wrote: > I believe client-idledeadtime-support-20080430 is a bit unfair, at least > it introduced the following problem, possibly on a wrong assumption: > > time 502.579567, pid 2195: Analyze RPC op 6 conn 0x4ae0b240 code > 0xfffffffd user 0x0 > time 502.579581, pid 2195: afs_Analyze out shouldRetry 0 > time 502.579608, pid 2195: Returning code -3 from 21 > > in words: an RXAFS_RemoveFile received an RX_CALL_TIMEOUT (-3), there is > no alternative server, and hence the operation is not retried. It would > have been retried in the RX_CALL_DEAD (-1) case (and plenty of other > cases). > > The comment in the code justifies that special treatment for > RX_CALL_TIMEOUT on the grounds that the call has timed out while server > was still responding to other calls. From the RX code I fail to see how > this is necessarily correct: rxi_CheckCall can be called from the > keepalive mechanism and stop the call without any interaction from the > server, you can get RX_CALL_TIMEOUT and not e.g. RX_CALL_DEAD even if > the server is completely hosed. > > In this particular case the server was stopped, the "hard mount" > functionality should have ensured that I/Os stay pending until the > server was restarted. > > Now, I don't know how to solve the yoyo up-down-up problem in case a > server times out calls selectively. When it happens, for > single-server-volumes the call should be retried in most cases. > > Would it be reasonable to blacklist only up to the last server and then > go the usual path which ensures a fair retry? > > (Actually, I wonder whether you can ensure you never get RX_CALL_TIMEOUT > if a server is simply restarted, which certainly deserves a retry).
There was a bug in the idle dead timeout processing in the rx library. A timeout would occur if the send window was full for longer than the timeout period. This is fixed by http://gerrit.openafs.org/#change,1183 Jeffrey Altman
smime.p7s
Description: S/MIME Cryptographic Signature
