On Thu, Aug 27, 2015 at 5:06 PM, Benjamin Kaduk <[email protected]> wrote:
> > On Thu, 27 Aug 2015, Jonathan Leung-Nilsson wrote: > > > So I am mainly wondering if this is expected - if OpenAFS depends on > having > > its lowest IP address server online all the time - or if it's likely that > > we have a configuration issue in our cell. > > The short answer is that clients are expected to continue functioning even > if the lowest-IP db server is offline, the remaining N-1 are supposed to elect a new coordinator and read-write access resume within a couple > election cycles; Thank you for confirming. This is what I thought, just wanted to check that I wasn't crazy. Our 2 remaining DB servers did select a new coordinator among themselves, so that part worked. clients might experience full hangs or just inability to > make database changes for a couple minutes as things recover. > "a couple minutes" would be bad enough, since we have websites using AFS as their DocumentRoot, but in our case it took a little over an hour until the incident was resolved (we replaced the network switch that the AFS db server was behind) and clients appeared unresponsive the entire time. The long answer requires more research and discussion of edge cases such > as network partitions, timeouts, and such, which I am not prepared to > perform right now. Yeah... that means this issue is very specific to our setup and the failure situation. I'll see if I have time to try to replicate it and figure out why the clients were unresponsive. Most likely we will find alternative ways to mitigate the impact of this kind of failure. Best, Jonathan
