[ 
https://issues.apache.org/jira/browse/HDFS-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13887302#comment-13887302
 ] 

Aaron T. Myers commented on HDFS-5399:
--------------------------------------

bq. Its not the test that directly fails. We see exceptions in the RM when its 
trying to talk to HDFS or in RS when its trying to talk to HDFS which causes 
the actual MR job etc to fail. So its not something that the test can control. 
For example we are running an MR job and are periodically killing the active NN 
and the job eventually fails as the tasks that want to talk to hdfs fail or the 
RM runs into this exception causing the application to fail.

I get that, but I'm specifically curious about whether or not the standby NN 
was given enough time to get out of startup safemode before a failover to it 
was attempted. Can you comment on how frequently/quickly the active NN is 
killed and restarted in this test?

bq.  Hence i would argue that its a flaw in the test .

I'm guessing you meant "NOT a flaw in the test" here? Or do I misunderstand 
your point?

> Revisit SafeModeException and corresponding retry policies
> ----------------------------------------------------------
>
>                 Key: HDFS-5399
>                 URL: https://issues.apache.org/jira/browse/HDFS-5399
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>    Affects Versions: 3.0.0
>            Reporter: Jing Zhao
>            Assignee: Jing Zhao
>
> Currently for NN SafeMode, we have the following corresponding retry policies:
> # In non-HA setup, for certain API call ("create"), the client will retry if 
> the NN is in SafeMode. Specifically, the client side's RPC adopts 
> MultipleLinearRandomRetry policy for a wrapped SafeModeException when retry 
> is enabled.
> # In HA setup, the client will retry if the NN is Active and in SafeMode. 
> Specifically, the SafeModeException is wrapped as a RetriableException in the 
> server side. Client side's RPC uses FailoverOnNetworkExceptionRetry policy 
> which recognizes RetriableException (see HDFS-5291).
> There are several possible issues in the current implementation:
> # The NN SafeMode can be a "Manual" SafeMode (i.e., started by administrator 
> through CLI), and the clients may not want to retry on this type of SafeMode.
> # Client may want to retry on other API calls in non-HA setup.
> # We should have a single generic strategy to address the mapping between 
> SafeMode and retry policy for both HA and non-HA setup. A possible 
> straightforward solution is to always wrap the SafeModeException in the 
> RetriableException to indicate that the clients should retry.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to