Hello, We're running a cluster of OpenAFS machines: 2 servers (more coming soon), and often up to 500 read-heavy clients. Occasionally (around once every 50,000+ access attempts) a client will temporarily receive the following error:
(from client syslog) Feb 24 08:37:17 ip-10-90-189-162 kernel: [6181788.182444] afs: Lost contact with file server IPADDR in cell CELL (all multi-homed ip addresses down for the server) Feb 24 08:37:33 ip-10-90-189-162 kernel: [6181805.056860] afs: file server IPADDR in cell CELL is back up (multi-homed address; other same-host interfaces may still be down) During that 16 second span of time, that client alone cannot access AFS. I don't see any message in the openafs server logs with matching timestamps. Currently, the servers are running 1.4.14 (will be upgraded to 1.6 soon) on Ubuntu 10.04. The clients are running 1.6.0 on Ubuntu 11.10. The clients are not human users, but processes that are constantly pulling data from AFS. What tools do I have at my disposal to debug this issue? What is the recommended approach to take? Off-email question: If a volume has N read replicas, how do clients choose which one to use? Best, Ken
