Wei-Chiu Chuang created HDDS-4430:
-------------------------------------

             Summary: OM failover timeout is too short
                 Key: HDDS-4430
                 URL: https://issues.apache.org/jira/browse/HDDS-4430
             Project: Hadoop Distributed Data Store
          Issue Type: Improvement
          Components: OM HA
    Affects Versions: 1.0.0, 1.1.0
            Reporter: Wei-Chiu Chuang


The current OM has one second failover timeout. This is too short as any 
network hiccup, system I/O or JVM GC pause could easily trigger a failover.

Example:
{noformat}
2020-10-29 09:02:46,557 WARN org.apache.ratis.server.impl.RaftServerImpl: 
om3@group-942F8267F22A-LeaderState: Lost leadership on term: 33. Election 
timeout: 1200ms. In charge for: 82665
0319ms. Conf: 32189729: [om1:rhelnn01.ozone.cisco.local:9872:0, 
om3:rhelnn03.ozone.cisco.local:9872:0, om2:rhelnn02.ozone.cisco.local:9872:0], 
old=null. Followers: [om3@group-942F8267F2
2A->om1(c34577386,m34577394,n34577395, attendVote=true, lastRpcSendTime=7, 
lastRpcResponseTime=0), 
om3@group-942F8267F22A->om2(c34577386,m34577261,n34577395, attendVote=true, 
lastRpcSen
dTime=7, lastRpcResponseTime=0)]
2020-10-29 09:02:46,558 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected 
pause in JVM or host machine (eg GC): pause of approximately 2236ms
No GCs detected
2020-10-29 09:02:46,562 INFO org.apache.ratis.server.impl.RaftServerImpl: 
om3@group-942F8267F22A: changes role from    LEADER to FOLLOWER at term 33 for 
stepDown
2020-10-29 09:02:46,563 INFO org.apache.ratis.server.impl.RoleInfo: om3: 
shutdown LeaderState
{noformat}

[~hanishakoneru] also thinks we should increase ratis leader election timeout 
too.

{noformat}
  <property>
    <name>ozone.om.ratis.minimum.timeout</name>
    <value>1s</value>
    <tag>OZONE, OM, RATIS, MANAGEMENT</tag>
    <description>The minimum timeout duration for OM's Ratis server rpc.
    </description>
  </property>

  <property>
    <name>ozone.om.leader.election.minimum.timeout.duration</name>
    <value>1s</value>
    <tag>OZONE, OM, RATIS, MANAGEMENT</tag>
    <description>The minimum timeout duration for OM ratis leader election.
      Default is 1s.
    </description
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to