[
https://issues.apache.org/jira/browse/HADOOP-7380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13086517#comment-13086517
]
Aaron T. Myers commented on HADOOP-7380:
----------------------------------------
As requested, here's a design doc:
h2. Client Retry/Fail Over in IPC Overview
The goal of HDFS-1973 is to provide a facility for clients to fail over and
retry an RPC to a different NN in the event of failure, in support of HDFS HA.
Since the HDFS RPC mechanisms are built on top of the Common IPC library, this
is a natural place to implement this functionality. Furthermore, since client
fail over in general has challenges which are not HDFS-specific, we may be able
to reuse this mechanism for other HA services in the future (e.g. perhaps the
MRv2 Resource Manager.)
h3. Goals
This mechanism should be able to support the following:
# A way to support multiple distinct ways for determining which object an RPC
should be attempted against.
# A way of specifying customized failover strategies. For example, some HA
service may only support a pair of machines to serve requests, while another
might support an arbitrary number. The strategy for failing over to a new
machine might be different for these cases.
# The method for specifying a retry/failover strategy should be able to control
both retry and failover logic. For example, a strategy may want to try a
request to server A once, attempt a failover to server B immediately, fail back
to server A and then retry several times with some amount of backoff.
# A way for a remote process to indicate that it is not the appropriate process
to serve the request, and that an attempt should be made to another process.
# A way of specifying that some operations in an IPC interface are not safe to
be failed over and retried.
h3. Approach
The Common IPC library already supports retrying an IPC against the same remote
process. This is done via the classes in the o.a.h.io.retry package. The exact
desired retry semantics can be specified using an implementation of the
{{RetryPolicy}} interface. An implementation of this interface is responsible
for determining whether or not a particular method call should be retried given
the number of times its already been tried and the particular exception which
caused the method to fail. An implementer of this interface can either choose
to have the failed method retried, or the exception re-thrown.
In a sense, client-side failover is exactly the same as the existing retry
mechanism, except method calls can be retried against a different proxy object.
Thus, this JIRA proposes to extend the existing retry facility to also support
client failover.
Presently, the {{RetryPolicy.shouldRetry}} method only returns a boolean to
indicate whether a method invocation should be retried or considered to have
failed. This can be augmented to return an enum value to indicate whether a
particular method invocation should fail, be retried against the same object,
or retried on another object. In order to determine what object a method should
be tried against, this JIRA introduces the concept of a
{{FailoverProxyProvider}}. An implementer of this interface is capable of
getting a proxy object and initiating a client failover when the result of
{{RetryPolicy.shouldRetry}} indicates a failover should be performed for the
particular {{RetryPolicy}} the {{RetryInvocationHandler}} is configured to use.
This addresses goals 1, 2, and 3 from above.
To address goal 4, this JIRA introduces a new exception type -
{{StandbyException}} - which can be thrown by remote processes to indicate that
it is not the appropriate process to handle requests for a given service at
this time. {{RetryPolicy}} implementations may choose to handle this exception
differently than other exception types when determining whether or not to retry
or fail over an operation.
Though there may be circumstances in which a client may desire a more complex
retry/failover strategy, most clients will want to failover only on network
exceptions in which an RPC is guaranteed to have not reached the remote
process, or in cases in which the particular method to be retried will have no
mutative ill-effects if retried (e.g. read operations or idempotent write
operations.) This JIRA thus introduces a new {{RetryPolicy}} implementation -
{{FailoverOnNetworkExceptionRetry}}. This retry policy fails over whenever a
method call fails in such a way as to guarantee that it did not reach the
original remote process, or when retrying a method which can be safely retried.
In order to know which methods of an IPC interface can be safely retried, this
JIRA introduces an {{@Idempotent}} method annotation which, if present, will be
passed on to the retry policy by the {{RetryInvocationHandler}} when
determining whether or not to retry a method. This addresses goal 5.
> Add client failover functionality to o.a.h.io.(ipc|retry)
> ---------------------------------------------------------
>
> Key: HADOOP-7380
> URL: https://issues.apache.org/jira/browse/HADOOP-7380
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: ipc
> Affects Versions: 0.23.0
> Reporter: Aaron T. Myers
> Assignee: Aaron T. Myers
> Fix For: 0.23.0
>
> Attachments: hadoop-7380-hdfs-example.patch, hadoop-7380.0.patch,
> hadoop-7380.1.patch, hadoop-7380.2.patch, hdfs-7380.3.patch
>
>
> Implementing client failover will likely require changes to {{o.a.h.io.ipc}}
> and/or {{o.a.h.io.retry}}. This JIRA is to track those changes.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira