So I managed to reproduce this here, and after much instrumenting and
tracing, I think that the problem is that the "areq" structure is not
always properly initialized in afs_Conn() when we test, among other
things, for skipserver = 1. In fact about 20% of the time, skipserver
and other fields (idleError, tokenError) look completely bogus.
According to comments in the code, areq->initd should be 1 for most of
the fields in that structure to be meaningful; traces show that when
initd is 1, the fields are indeed properly initialized.
Attached patch checks for initrd==1 before relying on skipserver - fixes
the problem for me - could you guys give it a try?
Marc
Index: src/afs/afs_conn.c
===================================================================
RCS file: /cvs/openafs/src/afs/afs_conn.c,v
retrieving revision 1.13.2.3
diff -u -r1.13.2.3 afs_conn.c
--- src/afs/afs_conn.c 29 Jun 2008 03:26:04 -0000 1.13.2.3
+++ src/afs/afs_conn.c 19 Oct 2008 23:05:47 -0000
@@ -84,7 +84,7 @@
/* First is always lowest rank, if it's up */
if ((tv->status[0] == not_busy) && tv->serverHost[0]
&& !(tv->serverHost[0]->addr->sa_flags & SRVR_ISDOWN) &&
- !(((areq->idleError > 0) || (areq->tokenError > 0))
+ !((areq->initd == 1) && ((areq->idleError > 0) || (areq->tokenError >
0))
&& (areq->skipserver[0] == 1)))
lowp = tv->serverHost[0]->addr;