[ 
https://issues.apache.org/jira/browse/HADOOP-7380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13086517#comment-13086517
 ] 

Aaron T. Myers commented on HADOOP-7380:
----------------------------------------

As requested, here's a design doc:

h2. Client Retry/Fail Over in IPC Overview

The goal of HDFS-1973 is to provide a facility for clients to fail over and 
retry an RPC to a different NN in the event of failure, in support of HDFS HA. 
Since the HDFS RPC mechanisms are built on top of the Common IPC library, this 
is a natural place to implement this functionality. Furthermore, since client 
fail over in general has challenges which are not HDFS-specific, we may be able 
to reuse this mechanism for other HA services in the future (e.g. perhaps the 
MRv2 Resource Manager.)

h3. Goals

This mechanism should be able to support the following:

# A way to support multiple distinct ways for determining which object an RPC 
should be attempted against.
# A way of specifying customized failover strategies. For example, some HA 
service may only support a pair of machines to serve requests, while another 
might support an arbitrary number. The strategy for failing over to a new 
machine might be different for these cases.
# The method for specifying a retry/failover strategy should be able to control 
both retry and failover logic. For example, a strategy may want to try a 
request to server A once, attempt a failover to server B immediately, fail back 
to server A and then retry several times with some amount of backoff.
# A way for a remote process to indicate that it is not the appropriate process 
to serve the request, and that an attempt should be made to another process.
# A way of specifying that some operations in an IPC interface are not safe to 
be failed over and retried.

h3. Approach

The Common IPC library already supports retrying an IPC against the same remote 
process. This is done via the classes in the o.a.h.io.retry package. The exact 
desired retry semantics can be specified using an implementation of the 
{{RetryPolicy}} interface. An implementation of this interface is responsible 
for determining whether or not a particular method call should be retried given 
the number of times its already been tried and the particular exception which 
caused the method to fail. An implementer of this interface can either choose 
to have the failed method retried, or the exception re-thrown.

In a sense, client-side failover is exactly the same as the existing retry 
mechanism, except method calls can be retried against a different proxy object. 
Thus, this JIRA proposes to extend the existing retry facility to also support 
client failover.

Presently, the {{RetryPolicy.shouldRetry}} method only returns a boolean to 
indicate whether a method invocation should be retried or considered to have 
failed. This can be augmented to return an enum value to indicate whether a 
particular method invocation should fail, be retried against the same object, 
or retried on another object. In order to determine what object a method should 
be tried against, this JIRA introduces the concept of a 
{{FailoverProxyProvider}}. An implementer of this interface is capable of 
getting a proxy object and initiating a client failover when the result of 
{{RetryPolicy.shouldRetry}} indicates a failover should be performed for the 
particular {{RetryPolicy}} the {{RetryInvocationHandler}} is configured to use. 
This addresses goals 1, 2, and 3 from above.

To address goal 4, this JIRA introduces a new exception type - 
{{StandbyException}} - which can be thrown by remote processes to indicate that 
it is not the appropriate process to handle requests for a given service at 
this time. {{RetryPolicy}} implementations may choose to handle this exception 
differently than other exception types when determining whether or not to retry 
or fail over an operation.

Though there may be circumstances in which a client may desire a more complex 
retry/failover strategy, most clients will want to failover only on network 
exceptions in which an RPC is guaranteed to have not reached the remote 
process, or in cases in which the particular method to be retried will have no 
mutative ill-effects if retried (e.g. read operations or idempotent write 
operations.) This JIRA thus introduces a new {{RetryPolicy}} implementation - 
{{FailoverOnNetworkExceptionRetry}}. This retry policy fails over whenever a 
method call fails in such a way as to guarantee that it did not reach the 
original remote process, or when retrying a method which can be safely retried. 
In order to know which methods of an IPC interface can be safely retried, this 
JIRA introduces an {{@Idempotent}} method annotation which, if present, will be 
passed on to the retry policy by the {{RetryInvocationHandler}} when 
determining whether or not to retry a method. This addresses goal 5.

> Add client failover functionality to o.a.h.io.(ipc|retry)
> ---------------------------------------------------------
>
>                 Key: HADOOP-7380
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7380
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: ipc
>    Affects Versions: 0.23.0
>            Reporter: Aaron T. Myers
>            Assignee: Aaron T. Myers
>             Fix For: 0.23.0
>
>         Attachments: hadoop-7380-hdfs-example.patch, hadoop-7380.0.patch, 
> hadoop-7380.1.patch, hadoop-7380.2.patch, hdfs-7380.3.patch
>
>
> Implementing client failover will likely require changes to {{o.a.h.io.ipc}} 
> and/or {{o.a.h.io.retry}}. This JIRA is to track those changes.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to