Hello, I have recovered from this situation already, but I was curious to hear if others have tested or experienced this issue as well:
If the AFS database server with the lowest IP address goes down or is offline, but there are 2 other database servers available, then are clients and the remaining servers supposed to be able to handle that situation gracefully? We had an incident where this happened (in our case, the database server was taken offline because the switch died), and then it appeared that AFS access (simply "ls /afs/<cellname>/") and vos commands were unresponsive. AFS database servers: lowest IP address pt/vl server: CentOS 5.11 32-bit OpenAFS 1.6.11 secondary and tertiary pt/vl server: CentOS 6.6 64-bit OpenAFS 1.6.11 Clients (at least these were affected, among others I am sure): CentOS 6.6 64-bit OpenAFS 1.6.1 CentOS 6.7 64-bit OpenAFS 1.6.9 Clients are all configured with -dynroot -fakestat-all and they have identical CellServDB files listing our database servers in order from lowest to highest IP. I apologize that I do not have much in the way of debugging output... I didn't think to run rxdebug on the client or a trace of the "ls" process. We were in "emergency mode" trying to get the switch replaced to bring services online, but I was still surprised that AFS exhibited this trouble. I will try to replicate this issue in a test cell in the near future... So I am mainly wondering if this is expected - if OpenAFS depends on having its lowest IP address server online all the time - or if it's likely that we have a configuration issue in our cell. I setup our cell about 5 years ago as a complete newbie to OpenAFS, and while I've gained a lot of insights and experience since, I still don't understand all the nuances. Thank you! -- Jonathan Leung-Nilsson Social Sciences Computing Services University of California, Irvine
