[ 
https://issues.apache.org/jira/browse/HDFS-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13891066#comment-13891066
 ] 

Todd Lipcon commented on HDFS-5399:
-----------------------------------

Catching up on this issue now, so apologies if I've missed some context in my 
reading of the discussion.

- One of the issues is that the SBN may be in Safe Mode while it's tailing. 
When it becomes active, it has the latest edits and can come out of safemode, 
but still goes through the extension. The original reason for the extension to 
prevent a replication storm in the case that the NN has only one replica of all 
the blocks, but several DNs haven't yet reported. In the SBN case, since we've 
already been running in standby mode for a while, it seems unlikely that the 
extension is necessary. Maybe we should consider changing the extension so 
that, if we don't have a significant number of under-replicated blocks, we 
don't go through the extension?

- Regardless, we should limit the number of retries as Jing proposed above. 
Retrying indefinitely should never be our default. How about we introduce a 
configuration here and default to ~30sec of retries? Those who want to retry 
forever could reconfigure to a longer time period.

> Revisit SafeModeException and corresponding retry policies
> ----------------------------------------------------------
>
>                 Key: HDFS-5399
>                 URL: https://issues.apache.org/jira/browse/HDFS-5399
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>    Affects Versions: 2.3.0
>            Reporter: Jing Zhao
>            Assignee: Jing Zhao
>
> Currently for NN SafeMode, we have the following corresponding retry policies:
> # In non-HA setup, for certain API call ("create"), the client will retry if 
> the NN is in SafeMode. Specifically, the client side's RPC adopts 
> MultipleLinearRandomRetry policy for a wrapped SafeModeException when retry 
> is enabled.
> # In HA setup, the client will retry if the NN is Active and in SafeMode. 
> Specifically, the SafeModeException is wrapped as a RetriableException in the 
> server side. Client side's RPC uses FailoverOnNetworkExceptionRetry policy 
> which recognizes RetriableException (see HDFS-5291).
> There are several possible issues in the current implementation:
> # The NN SafeMode can be a "Manual" SafeMode (i.e., started by administrator 
> through CLI), and the clients may not want to retry on this type of SafeMode.
> # Client may want to retry on other API calls in non-HA setup.
> # We should have a single generic strategy to address the mapping between 
> SafeMode and retry policy for both HA and non-HA setup. A possible 
> straightforward solution is to always wrap the SafeModeException in the 
> RetriableException to indicate that the clients should retry.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to