[
https://issues.apache.org/jira/browse/HDDS-4430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wei-Chiu Chuang reassigned HDDS-4430:
-------------------------------------
Assignee: Hanisha Koneru
> OM failover timeout is too short
> --------------------------------
>
> Key: HDDS-4430
> URL: https://issues.apache.org/jira/browse/HDDS-4430
> Project: Hadoop Distributed Data Store
> Issue Type: Improvement
> Components: OM HA
> Affects Versions: 1.0.0, 1.1.0
> Reporter: Wei-Chiu Chuang
> Assignee: Hanisha Koneru
> Priority: Critical
>
> The current OM has one second failover timeout. This is too short as any
> network hiccup, system I/O or JVM GC pause could easily trigger a failover.
> Example:
> {noformat}
> 2020-10-29 09:02:46,557 WARN org.apache.ratis.server.impl.RaftServerImpl:
> om3@group-942F8267F22A-LeaderState: Lost leadership on term: 33. Election
> timeout: 1200ms. In charge for: 82665
> 0319ms. Conf: 32189729: [om1:rhelnn01.ozone.cisco.local:9872:0,
> om3:rhelnn03.ozone.cisco.local:9872:0,
> om2:rhelnn02.ozone.cisco.local:9872:0], old=null. Followers:
> [om3@group-942F8267F2
> 2A->om1(c34577386,m34577394,n34577395, attendVote=true, lastRpcSendTime=7,
> lastRpcResponseTime=0),
> om3@group-942F8267F22A->om2(c34577386,m34577261,n34577395, attendVote=true,
> lastRpcSen
> dTime=7, lastRpcResponseTime=0)]
> 2020-10-29 09:02:46,558 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected
> pause in JVM or host machine (eg GC): pause of approximately 2236ms
> No GCs detected
> 2020-10-29 09:02:46,562 INFO org.apache.ratis.server.impl.RaftServerImpl:
> om3@group-942F8267F22A: changes role from LEADER to FOLLOWER at term 33
> for stepDown
> 2020-10-29 09:02:46,563 INFO org.apache.ratis.server.impl.RoleInfo: om3:
> shutdown LeaderState
> {noformat}
> [~hanishakoneru] also thinks we should increase ratis leader election timeout
> too.
> {noformat}
> <property>
> <name>ozone.om.ratis.minimum.timeout</name>
> <value>1s</value>
> <tag>OZONE, OM, RATIS, MANAGEMENT</tag>
> <description>The minimum timeout duration for OM's Ratis server rpc.
> </description>
> </property>
> <property>
> <name>ozone.om.leader.election.minimum.timeout.duration</name>
> <value>1s</value>
> <tag>OZONE, OM, RATIS, MANAGEMENT</tag>
> <description>The minimum timeout duration for OM ratis leader election.
> Default is 1s.
> </description
> {noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]