So I managed to reproduce this here, and after much instrumenting and
tracing, I think that the problem is that the "areq" structure is not
always properly initialized in afs_Conn() when we test, among other
things, for skipserver = 1. In fact about 20% of the time, skipserver and other fields (idleError, tokenError) look completely bogus.

According to comments in the code, areq->initd should be 1 for most of the fields in that structure to be meaningful; traces show that when initd is 1, the fields are indeed properly initialized.

Attached patch checks for initrd==1 before relying on skipserver - fixes the problem for me - could you guys give it a try?

Marc

Index: src/afs/afs_conn.c
===================================================================
RCS file: /cvs/openafs/src/afs/afs_conn.c,v
retrieving revision 1.13.2.3
diff -u -r1.13.2.3 afs_conn.c
--- src/afs/afs_conn.c  29 Jun 2008 03:26:04 -0000      1.13.2.3
+++ src/afs/afs_conn.c  19 Oct 2008 23:05:47 -0000
@@ -84,7 +84,7 @@
     /* First is always lowest rank, if it's up */
     if ((tv->status[0] == not_busy) && tv->serverHost[0]
        && !(tv->serverHost[0]->addr->sa_flags & SRVR_ISDOWN) &&
-       !(((areq->idleError > 0) || (areq->tokenError > 0))
+       !((areq->initd == 1) && ((areq->idleError > 0) || (areq->tokenError > 
0))
          && (areq->skipserver[0] == 1)))
        lowp = tv->serverHost[0]->addr;
 

Reply via email to