[ 
https://issues.apache.org/jira/browse/HDFS-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13886040#comment-13886040
 ] 

Jing Zhao commented on HDFS-5399:
---------------------------------

bq. If so, the fact that the former standby NN is going into safemode upon 
transition to active is the real bug here
It's not like this. SBN will not put itself into safemode because of 
transitioning to active state. What we saw in our test is: the SBN cannot come 
out of the safemode thus the safemode object is not null when failover happens. 
And when the SBN becomes active, it can quickly go into the safemode extension 
period, but this still adds an extra 30 seconds to the no-service time. 

Thus the question is, why the NN can quickly go into the safemode extension 
period while in active state, but keeps staying in safemode in standby state? 
In our test we have a lot of file creation/deletion happening. Is it possible 
that the SBN keeps tailing the editlog while hold the FSN lock, thus the 
SafeModeMonitor thread could not get the lock to leave the safemode?

> Revisit SafeModeException and corresponding retry policies
> ----------------------------------------------------------
>
>                 Key: HDFS-5399
>                 URL: https://issues.apache.org/jira/browse/HDFS-5399
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>    Affects Versions: 3.0.0
>            Reporter: Jing Zhao
>            Assignee: Jing Zhao
>
> Currently for NN SafeMode, we have the following corresponding retry policies:
> # In non-HA setup, for certain API call ("create"), the client will retry if 
> the NN is in SafeMode. Specifically, the client side's RPC adopts 
> MultipleLinearRandomRetry policy for a wrapped SafeModeException when retry 
> is enabled.
> # In HA setup, the client will retry if the NN is Active and in SafeMode. 
> Specifically, the SafeModeException is wrapped as a RetriableException in the 
> server side. Client side's RPC uses FailoverOnNetworkExceptionRetry policy 
> which recognizes RetriableException (see HDFS-5291).
> There are several possible issues in the current implementation:
> # The NN SafeMode can be a "Manual" SafeMode (i.e., started by administrator 
> through CLI), and the clients may not want to retry on this type of SafeMode.
> # Client may want to retry on other API calls in non-HA setup.
> # We should have a single generic strategy to address the mapping between 
> SafeMode and retry policy for both HA and non-HA setup. A possible 
> straightforward solution is to always wrap the SafeModeException in the 
> RetriableException to indicate that the clients should retry.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to