[ 
https://issues.apache.org/jira/browse/HDDS-4430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang reassigned HDDS-4430:
-------------------------------------

    Assignee: Hanisha Koneru

> OM failover timeout is too short
> --------------------------------
>
>                 Key: HDDS-4430
>                 URL: https://issues.apache.org/jira/browse/HDDS-4430
>             Project: Hadoop Distributed Data Store
>          Issue Type: Improvement
>          Components: OM HA
>    Affects Versions: 1.0.0, 1.1.0
>            Reporter: Wei-Chiu Chuang
>            Assignee: Hanisha Koneru
>            Priority: Critical
>
> The current OM has one second failover timeout. This is too short as any 
> network hiccup, system I/O or JVM GC pause could easily trigger a failover.
> Example:
> {noformat}
> 2020-10-29 09:02:46,557 WARN org.apache.ratis.server.impl.RaftServerImpl: 
> om3@group-942F8267F22A-LeaderState: Lost leadership on term: 33. Election 
> timeout: 1200ms. In charge for: 82665
> 0319ms. Conf: 32189729: [om1:rhelnn01.ozone.cisco.local:9872:0, 
> om3:rhelnn03.ozone.cisco.local:9872:0, 
> om2:rhelnn02.ozone.cisco.local:9872:0], old=null. Followers: 
> [om3@group-942F8267F2
> 2A->om1(c34577386,m34577394,n34577395, attendVote=true, lastRpcSendTime=7, 
> lastRpcResponseTime=0), 
> om3@group-942F8267F22A->om2(c34577386,m34577261,n34577395, attendVote=true, 
> lastRpcSen
> dTime=7, lastRpcResponseTime=0)]
> 2020-10-29 09:02:46,558 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected 
> pause in JVM or host machine (eg GC): pause of approximately 2236ms
> No GCs detected
> 2020-10-29 09:02:46,562 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> om3@group-942F8267F22A: changes role from    LEADER to FOLLOWER at term 33 
> for stepDown
> 2020-10-29 09:02:46,563 INFO org.apache.ratis.server.impl.RoleInfo: om3: 
> shutdown LeaderState
> {noformat}
> [~hanishakoneru] also thinks we should increase ratis leader election timeout 
> too.
> {noformat}
>   <property>
>     <name>ozone.om.ratis.minimum.timeout</name>
>     <value>1s</value>
>     <tag>OZONE, OM, RATIS, MANAGEMENT</tag>
>     <description>The minimum timeout duration for OM's Ratis server rpc.
>     </description>
>   </property>
>   <property>
>     <name>ozone.om.leader.election.minimum.timeout.duration</name>
>     <value>1s</value>
>     <tag>OZONE, OM, RATIS, MANAGEMENT</tag>
>     <description>The minimum timeout duration for OM ratis leader election.
>       Default is 1s.
>     </description
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to