Hello,

We're running a cluster of OpenAFS machines: 2 servers (more coming soon),
and often up to 500 read-heavy clients. Occasionally (around once every
50,000+ access attempts) a client will temporarily receive the following
error:

(from client syslog)
Feb 24 08:37:17 ip-10-90-189-162 kernel: [6181788.182444] afs: Lost contact
with file server IPADDR in cell CELL (all multi-homed ip addresses down for
the server)
Feb 24 08:37:33 ip-10-90-189-162 kernel: [6181805.056860] afs: file server
IPADDR in cell CELL is back up (multi-homed address; other same-host
interfaces may still be down)

During that 16 second span of time, that client alone cannot access AFS.

I don't see any message in the openafs server logs with matching timestamps.

Currently, the servers are running 1.4.14 (will be upgraded to 1.6 soon) on
Ubuntu 10.04. The clients are running 1.6.0 on Ubuntu 11.10. The clients
are not human users, but processes that are constantly pulling data from
AFS.

What tools do I have at my disposal to debug this issue? What is the
recommended approach to take?

Off-email question: If a volume has N read replicas, how do clients choose
which one to use?

Best,
Ken

Reply via email to