[ 
https://issues.apache.org/jira/browse/HDFS-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13885930#comment-13885930
 ] 

Aaron T. Myers commented on HDFS-5399:
--------------------------------------

On that JIRA I asked the following question:

bq. Is my understanding of this issue correct that the only thing we're trying 
to fix here is the fact the clients are not retrying attempting to talk to the 
active NN when it receives a safemode exception? i.e. it's not the case that 
the standby NN is somehow incorrectly going into safemode after a failover?

I concluded (perhaps incorrectly) based on Jing's response that I was correct 
in my understanding of the issue, but it seems that I was not. If so, the fact 
that the former standby NN is going into safemode upon transition to active is 
the real bug here, not that clients don't retry when the NN is in safemode, and 
that's what we should be fixing, not the client RPC retry behavior.

Jing/Arpit - do either of you have any insight as to why you observed the NN 
going into safemode upon transition to active? If we can figure that out, then 
we should fix that, and perhaps revert or modify the new behavior introduced in 
HDFS-5291.

> Revisit SafeModeException and corresponding retry policies
> ----------------------------------------------------------
>
>                 Key: HDFS-5399
>                 URL: https://issues.apache.org/jira/browse/HDFS-5399
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>    Affects Versions: 3.0.0
>            Reporter: Jing Zhao
>            Assignee: Jing Zhao
>
> Currently for NN SafeMode, we have the following corresponding retry policies:
> # In non-HA setup, for certain API call ("create"), the client will retry if 
> the NN is in SafeMode. Specifically, the client side's RPC adopts 
> MultipleLinearRandomRetry policy for a wrapped SafeModeException when retry 
> is enabled.
> # In HA setup, the client will retry if the NN is Active and in SafeMode. 
> Specifically, the SafeModeException is wrapped as a RetriableException in the 
> server side. Client side's RPC uses FailoverOnNetworkExceptionRetry policy 
> which recognizes RetriableException (see HDFS-5291).
> There are several possible issues in the current implementation:
> # The NN SafeMode can be a "Manual" SafeMode (i.e., started by administrator 
> through CLI), and the clients may not want to retry on this type of SafeMode.
> # Client may want to retry on other API calls in non-HA setup.
> # We should have a single generic strategy to address the mapping between 
> SafeMode and retry policy for both HA and non-HA setup. A possible 
> straightforward solution is to always wrap the SafeModeException in the 
> RetriableException to indicate that the clients should retry.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to