[
https://issues.apache.org/jira/browse/HDFS-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13887320#comment-13887320
]
Arpit Gupta commented on HDFS-5399:
-----------------------------------
bq. Can you comment on how frequently/quickly the active NN is killed and
restarted in this test?
The tests were killing the active namenode every 5 Mins.
bq. I'm guessing you meant "NOT a flaw in the test" here? Or do I misunderstand
your point?
Yes you are correct i meant not :).
bq. I'm specifically curious about whether or not the standby NN was given
enough time to get out of startup safemode before a failover to it was
attempted.
I wanted to make sure i understand this scenario. To me this would happen if
the current standby namenode (nn2) was active before and recently (a few
seconds ago) was killed and started causing it be in safemode and then the
active (nn1) at the same time was killed causing the client to go to nn2 and
its still in safemode. Did i understand it right?
I dont believe we hit this scenario as we restarted the active NN every 5 mins.
However i can see the need of client retires to make sure even during the above
scenario dfsclient is able to retry and wait for the nn to come out of safemode.
> Revisit SafeModeException and corresponding retry policies
> ----------------------------------------------------------
>
> Key: HDFS-5399
> URL: https://issues.apache.org/jira/browse/HDFS-5399
> Project: Hadoop HDFS
> Issue Type: Improvement
> Affects Versions: 3.0.0
> Reporter: Jing Zhao
> Assignee: Jing Zhao
>
> Currently for NN SafeMode, we have the following corresponding retry policies:
> # In non-HA setup, for certain API call ("create"), the client will retry if
> the NN is in SafeMode. Specifically, the client side's RPC adopts
> MultipleLinearRandomRetry policy for a wrapped SafeModeException when retry
> is enabled.
> # In HA setup, the client will retry if the NN is Active and in SafeMode.
> Specifically, the SafeModeException is wrapped as a RetriableException in the
> server side. Client side's RPC uses FailoverOnNetworkExceptionRetry policy
> which recognizes RetriableException (see HDFS-5291).
> There are several possible issues in the current implementation:
> # The NN SafeMode can be a "Manual" SafeMode (i.e., started by administrator
> through CLI), and the clients may not want to retry on this type of SafeMode.
> # Client may want to retry on other API calls in non-HA setup.
> # We should have a single generic strategy to address the mapping between
> SafeMode and retry policy for both HA and non-HA setup. A possible
> straightforward solution is to always wrap the SafeModeException in the
> RetriableException to indicate that the clients should retry.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)