[jira] [Commented] (HDFS-5399) Revisit SafeModeException and corresponding retry policies

Aaron T. Myers (JIRA) Wed, 29 Jan 2014 18:42:31 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13886202#comment-13886202
 ]


Aaron T. Myers commented on HDFS-5399:
--------------------------------------

bq. What if a lot of file creation/deletion requests keep coming? If the 
editlog keeps growing, is it possible that the SBN keeps tailing the editlog in 
a single session and cannot get a change to go back to sleep?

I don't think so. The EditLogTailer sleeps the same amount of time between 
reading the edit log regardless, so it definitely will release the FSN lock.

I think the most likely possibilities are:

# For some reason we're not doing the "should we leave safemode" check when in 
the standby state.
# The test you observed this issue in didn't run long enough for the standby NN 
to leave startup safemode on its own before the failover was attempted. The NN 
will delay processing block reports for block IDs it doesn't recognize (because 
they're created in edits that the NN hasn't read yet) and then only on 
transition to active do we fully catch up by reading all the edits, and then 
re-process the delayed block reports, triggering the NN to leave startup 
safemode.

If it's something like the first one, then seems like a legit bug. If it's more 
like the latter, then that seems more like a flaw of the test you were running.

Jing - thanks for trying to repro the test. I'm looking forward to hearing your 
findings.

> Revisit SafeModeException and corresponding retry policies
> ----------------------------------------------------------
>
>                 Key: HDFS-5399
>                 URL: https://issues.apache.org/jira/browse/HDFS-5399
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>    Affects Versions: 3.0.0
>            Reporter: Jing Zhao
>            Assignee: Jing Zhao
>
> Currently for NN SafeMode, we have the following corresponding retry policies:
> # In non-HA setup, for certain API call ("create"), the client will retry if 
> the NN is in SafeMode. Specifically, the client side's RPC adopts 
> MultipleLinearRandomRetry policy for a wrapped SafeModeException when retry 
> is enabled.
> # In HA setup, the client will retry if the NN is Active and in SafeMode. 
> Specifically, the SafeModeException is wrapped as a RetriableException in the 
> server side. Client side's RPC uses FailoverOnNetworkExceptionRetry policy 
> which recognizes RetriableException (see HDFS-5291).
> There are several possible issues in the current implementation:
> # The NN SafeMode can be a "Manual" SafeMode (i.e., started by administrator 
> through CLI), and the clients may not want to retry on this type of SafeMode.
> # Client may want to retry on other API calls in non-HA setup.
> # We should have a single generic strategy to address the mapping between 
> SafeMode and retry policy for both HA and non-HA setup. A possible 
> straightforward solution is to always wrap the SafeModeException in the 
> RetriableException to indicate that the clients should retry.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (HDFS-5399) Revisit SafeModeException and corresponding retry policies

Reply via email to