[ 
https://issues.apache.org/jira/browse/HDFS-9119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901534#comment-14901534
 ] 

Zhe Zhang commented on HDFS-9119:
---------------------------------

We have a few options to fix the discrepancy:
# Shorten the edit log tailing interval from 2 mins to 1 min.
# Change the timeout of {{transitionToActive}} to 2 mins. This will allow us to 
add the logic to support per-RPC timeout configuration.
# A more complex solution is to add a {{prepareTransitionToActive}} RPC call.

I'm leaning toward solution #1 because it's the simplest, and more frequent 
edit log tailing (and subsequently, more edit log segments) should be an 
acceptable behavior. Please let me know if you have any concern on this 
approach.

> Discrepancy between edit log tailing interval and RPC timeout for 
> transitionToActive
> ------------------------------------------------------------------------------------
>
>                 Key: HDFS-9119
>                 URL: https://issues.apache.org/jira/browse/HDFS-9119
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 2.7.1
>            Reporter: Zhe Zhang
>            Assignee: Zhe Zhang
>
> {{EditLogTailer}} on standby NameNode tails edits from active NameNode every 
> 2 minutes. But the {{transitionToActive}} RPC call has a timeout of 1 minute.
> If active NameNode encounters very intensive metadata workload (in 
> particular, a lot of {{AddOp}} and {{MkDir}} operations to create new files 
> and directories), the amount of updates accumulated in the 2 mins edit log 
> tailing interval is hard for the standby NameNode to catch up in the 1 min 
> timeout window. If that happens, the FailoverController will timeout and give 
> up trying to transition the standby to active. The old ANN will resume adding 
> more edits. When the SbNN finally finishes catching up the edits and tries to 
> become active, it will crash.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to