Wei-Chiu Chuang created HDDS-4430:
-------------------------------------
Summary: OM failover timeout is too short
Key: HDDS-4430
URL: https://issues.apache.org/jira/browse/HDDS-4430
Project: Hadoop Distributed Data Store
Issue Type: Improvement
Components: OM HA
Affects Versions: 1.0.0, 1.1.0
Reporter: Wei-Chiu Chuang
The current OM has one second failover timeout. This is too short as any
network hiccup, system I/O or JVM GC pause could easily trigger a failover.
Example:
{noformat}
2020-10-29 09:02:46,557 WARN org.apache.ratis.server.impl.RaftServerImpl:
om3@group-942F8267F22A-LeaderState: Lost leadership on term: 33. Election
timeout: 1200ms. In charge for: 82665
0319ms. Conf: 32189729: [om1:rhelnn01.ozone.cisco.local:9872:0,
om3:rhelnn03.ozone.cisco.local:9872:0, om2:rhelnn02.ozone.cisco.local:9872:0],
old=null. Followers: [om3@group-942F8267F2
2A->om1(c34577386,m34577394,n34577395, attendVote=true, lastRpcSendTime=7,
lastRpcResponseTime=0),
om3@group-942F8267F22A->om2(c34577386,m34577261,n34577395, attendVote=true,
lastRpcSen
dTime=7, lastRpcResponseTime=0)]
2020-10-29 09:02:46,558 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected
pause in JVM or host machine (eg GC): pause of approximately 2236ms
No GCs detected
2020-10-29 09:02:46,562 INFO org.apache.ratis.server.impl.RaftServerImpl:
om3@group-942F8267F22A: changes role from LEADER to FOLLOWER at term 33 for
stepDown
2020-10-29 09:02:46,563 INFO org.apache.ratis.server.impl.RoleInfo: om3:
shutdown LeaderState
{noformat}
[~hanishakoneru] also thinks we should increase ratis leader election timeout
too.
{noformat}
<property>
<name>ozone.om.ratis.minimum.timeout</name>
<value>1s</value>
<tag>OZONE, OM, RATIS, MANAGEMENT</tag>
<description>The minimum timeout duration for OM's Ratis server rpc.
</description>
</property>
<property>
<name>ozone.om.leader.election.minimum.timeout.duration</name>
<value>1s</value>
<tag>OZONE, OM, RATIS, MANAGEMENT</tag>
<description>The minimum timeout duration for OM ratis leader election.
Default is 1s.
</description
{noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]