For Accumulo, like you said, information is published in ZK for clients
to find. Thinking about just the Accumulo master process (tabletservers
follow the same principle but in a slightly different way), clients will
cache that location from ZK and then on some RPC transport failure (e.g.
ConnectException -- Thrift exception in Accumulo's case), the client
will invalidate that cache, refresh the location from ZK and try again.
The exit is either the client gets a connection to the master after
retrying enough, or the code just gives up. I think we tend to keep
spinning in Accumulo (perhaps more than we should) which hides
"expected" failures from clients completely (clients don't have to be
aware that they're talking to a new master than they were before), but
it can make transport issues harder to diagnose.
Steve Loughran wrote:
Ted& Billie
I'm updating the YARN-2683 registry documentation, including some guidelines on
how to handle failure to communicate with a remote server.
This is a problem which the Slider REST client has itself; currently the REST
client fails immediately; if a new attempt is made to look up the
AM service API URL it will pick up a new value.
The API client is therefore not handling the rebinding itself. Which is a bit
of a get-out; I'm relying on the AM not failing very often. Provided
the clients don't hold onto their client API objects for long, they "should" be
OK.
What do HBase and Accumulo do here? I know they publish their binding info via
ZK —but how do they fail over?