On Thu, Aug 27, 2015 at 5:06 PM, Benjamin Kaduk <[email protected]> wrote:

>
> On Thu, 27 Aug 2015, Jonathan Leung-Nilsson wrote:
>
> > So I am mainly wondering if this is expected - if OpenAFS depends on
> having
> > its lowest IP address server online all the time - or if it's likely that
> > we have a configuration issue in our cell.
>
> The short answer is that clients are expected to continue functioning even
> if the lowest-IP db server is offline, the remaining N-1 are supposed to

elect a new coordinator and read-write access resume within a couple
> election cycles;


Thank you for confirming. This is what I thought, just wanted to check that
I wasn't crazy. Our 2 remaining DB servers did select a new coordinator
among themselves, so that part worked.

clients might experience full hangs or just inability to
> make database changes for a couple minutes as things recover.
>

"a couple minutes" would be bad enough, since we have websites using AFS as
their DocumentRoot, but in our case it took a little over an hour until the
incident was resolved (we replaced the network switch that the AFS db
server was behind) and clients appeared unresponsive the entire time.

The long answer requires more research and discussion of edge cases such
> as network partitions, timeouts, and such, which I am not prepared to
> perform right now.


Yeah... that means this issue is very specific to our setup and the failure
situation. I'll see if I have time to try to replicate it and figure out
why the clients were unresponsive. Most likely we will find alternative
ways to mitigate the impact of this kind of failure.

Best,
Jonathan

Reply via email to