[jira] [Commented] (HDFS-5399) Revisit SafeModeException and corresponding retry policies

Jing Zhao (JIRA) Wed, 29 Jan 2014 13:33:06 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13885837#comment-13885837
 ]


Jing Zhao commented on HDFS-5399:
---------------------------------

The client will not fail over. It will retry the same NN (and this NN throws 
RetriableException only when it's in active state). But I think we may want to 
add a maximum retry times there.

bq. Formerly, we did not retry safe mode exceptions, whether or not we were in 
HA mode.
The issue with HA setup is that the SBN may stay in safemode for a long time 
and when it transitions to the active state, it needs at least >30s to come out 
of the safemode. This makes the actual failover time long since the old 
behavior is that the client will retry only once. This can then cause HBase 
region server to timeout and kill itself. Thus we need to let client wait and 
retry longer time.

But in the meanwhile, I think we should revisit this safemode extension and see 
if we can avoid NN to go to unnecessary safemode and shorten the safemode 
period.

> Revisit SafeModeException and corresponding retry policies
> ----------------------------------------------------------
>
>                 Key: HDFS-5399
>                 URL: https://issues.apache.org/jira/browse/HDFS-5399
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>    Affects Versions: 3.0.0
>            Reporter: Jing Zhao
>            Assignee: Jing Zhao
>
> Currently for NN SafeMode, we have the following corresponding retry policies:
> # In non-HA setup, for certain API call ("create"), the client will retry if 
> the NN is in SafeMode. Specifically, the client side's RPC adopts 
> MultipleLinearRandomRetry policy for a wrapped SafeModeException when retry 
> is enabled.
> # In HA setup, the client will retry if the NN is Active and in SafeMode. 
> Specifically, the SafeModeException is wrapped as a RetriableException in the 
> server side. Client side's RPC uses FailoverOnNetworkExceptionRetry policy 
> which recognizes RetriableException (see HDFS-5291).
> There are several possible issues in the current implementation:
> # The NN SafeMode can be a "Manual" SafeMode (i.e., started by administrator 
> through CLI), and the clients may not want to retry on this type of SafeMode.
> # Client may want to retry on other API calls in non-HA setup.
> # We should have a single generic strategy to address the mapping between 
> SafeMode and retry policy for both HA and non-HA setup. A possible 
> straightforward solution is to always wrap the SafeModeException in the 
> RetriableException to indicate that the clients should retry.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (HDFS-5399) Revisit SafeModeException and corresponding retry policies

Reply via email to