Sammi Chen created RATIS-2039:
---------------------------------
Summary: Leader lection requests should apply timeout close to
election timeout
Key: RATIS-2039
URL: https://issues.apache.org/jira/browse/RATIS-2039
Project: Ratis
Issue Type: Improvement
Reporter: Sammi Chen
In the following log, election timeout is 5s. om62 election failed when its
vote request timeouted at GRPC level with 3s although it passed the pre-vote
process. After the request timeout, there is no retry, although it's still 2s
left toward election timeout. The ideal state is om62 requests use the same(or
a little bit smaller than) timeout as election timeout, so that in the
following case, the election will finally got response and finish successfully.
{code:java}
2024-03-04 04:14:21,186 WARN
[grpc-default-executor-30]-org.apache.ratis.grpc.server.GrpcLogAppender:
om62@group-009531620E4C->om63-AppendLogResponseHandler: Failed appendEntries:
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io
exception
2024-03-04 04:14:21,188 WARN
[grpc-default-executor-30]-org.apache.ratis.grpc.server.GrpcLogAppender:
om62@group-009531620E4C->om63-GrpcLogAppender: Follower failed (request=null,
errorCount=796555); keep nextIndex (209384) unchanged and retry.
2024-03-04 04:14:25,169 INFO [Socket Reader #1 for port
9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for
[email protected] (auth:TOKEN)
2024-03-04 04:14:25,171 INFO [Socket Reader #1 for port
9862]-SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
Authorization successful for [email protected] (auth:TOKEN) for
protocol=interface org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol
2024-03-04 04:14:26,186 WARN
[timer5]-org.apache.ratis.grpc.server.GrpcLogAppender:
om62@group-009531620E4C->om63-AppendLogResponseHandler: Failed appendEntries
(Repeated 8 times in the last 5.000s):
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io
exception
2024-03-04 04:14:26,188 WARN
[timer6]-org.apache.ratis.grpc.server.GrpcLogAppender:
om62@group-009531620E4C->om63-GrpcLogAppender: Follower failed (request=null,
errorCount=796562); keep nextIndex (209384) unchanged and retry. (Repeated 8
times in the last 5.000s)
2024-03-04 04:14:26,193 WARN
[grpc-default-executor-21]-org.apache.ratis.grpc.server.GrpcLogAppender:
om62@group-009531620E4C->om63-AppendLogResponseHandler: Failed appendEntries:
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io
exception
2024-03-04 04:14:26,194 WARN
[grpc-default-executor-21]-org.apache.ratis.grpc.server.GrpcLogAppender:
om62@group-009531620E4C->om63-GrpcLogAppender: Follower failed (request=null,
errorCount=796563); keep nextIndex (209384) unchanged and retry.
2024-03-04 04:14:26,261 WARN
[timer6]-org.apache.ratis.grpc.server.GrpcLogAppender:
om62@group-009531620E4C->om61-GrpcLogAppender: Timed out HEARTBEAT
appendEntries, errorCount=1,
request=AppendEntriesRequest:cid=254709,entriesCount=0
2024-03-04 04:14:26,653 WARN
[om62@group-009531620E4C-LeaderStateImpl]-org.apache.ratis.server.RaftServer$Division:
om62@group-009531620E4C-LeaderStateImpl: Lost leadership on term: 19. Election
timeout: 5200ms. In charge for: 415277057ms. Conf: 142632:
peers:[om61|ccycloud-2.ozn-hbchf3oz.root.comops.site:9872,
om62|ccycloud-6.ozn-hbchf3oz.root.comops.site:9872,
om63|ccycloud-4.ozn-hbchf3oz.root.comops.site:9872]|listeners:[], old=null
2024-03-04 04:14:26,659 WARN
[om62@group-009531620E4C-LeaderStateImpl]-org.apache.ratis.server.RaftServer$Division:
Follower om62@group-009531620E4C->om61(c209383,m209383,n209384, caughtUp=true,
lastRpcSendTime=898, lastRpcResponseTime=5900)
2024-03-04 04:14:26,659 WARN
[om62@group-009531620E4C-LeaderStateImpl]-org.apache.ratis.server.RaftServer$Division:
Follower om62@group-009531620E4C->om63(c-1,m-1,n209384, caughtUp=true,
lastRpcSendTime=464, lastRpcResponseTime=415282260)
2024-03-04 04:14:26,661 INFO
[om62@group-009531620E4C-LeaderStateImpl]-org.apache.ratis.server.RaftServer$Division:
om62@group-009531620E4C: changes role from LEADER to FOLLOWER at term 19
for StepDownReason:LOST_MAJORITY_HEARTBEATS
2024-03-04 04:14:26,662 INFO
[om62@group-009531620E4C-LeaderStateImpl]-org.apache.ratis.server.impl.RoleInfo:
om62: shutdown om62@group-009531620E4C-LeaderStateImpl
2024-03-04 04:14:26,681 WARN
[om62@group-009531620E4C->om63-GrpcLogAppender-LogAppenderDaemon]-org.apache.ratis.grpc.server.GrpcLogAppender:
om62@group-009531620E4C->om63-GrpcLogAppender: Wait interrupted by
java.lang.InterruptedException
2024-03-04 04:14:26,681 WARN
[om62@group-009531620E4C->om61-GrpcLogAppender-LogAppenderDaemon]-org.apache.ratis.grpc.server.GrpcLogAppender:
om62@group-009531620E4C->om61-GrpcLogAppender: Wait interrupted by
java.lang.InterruptedException
2024-03-04 04:14:26,688 INFO
[om62@group-009531620E4C-LeaderStateImpl]-org.apache.ratis.server.impl.PendingRequests:
om62@group-009531620E4C-PendingRequests: sendNotLeaderResponses
2024-03-04 04:14:26,700 INFO
[om62@group-009531620E4C-LeaderStateImpl]-org.apache.ratis.server.RaftServer$Division:
om62@group-009531620E4C: change Leader from om62 to null at term 19 for
StepDownReason:LOST_MAJORITY_HEARTBEATS
2024-03-04 04:14:26,702 INFO
[om62@group-009531620E4C-LeaderStateImpl]-org.apache.ratis.server.impl.RoleInfo:
om62: start om62@group-009531620E4C-FollowerState
2024-03-04 04:14:27,443 WARN [Socket Reader #1 for port
9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for
10.140.105.13:51482:null (DIGEST-MD5: IO error acquiring password) with true
cause: (OM:om62 is not the leader. Could not determine the leader node.)
2024-03-04 04:14:27,510 WARN
[timer6]-org.apache.ratis.grpc.server.GrpcLogAppender:
om62@group-009531620E4C->om61-GrpcLogAppender: Timed out HEARTBEAT
appendEntries, errorCount=1,
request=AppendEntriesRequest:cid=254710,entriesCount=0
2024-03-04 04:14:28,656 INFO
[grpc-default-executor-30]-org.apache.ratis.grpc.server.GrpcLogAppender:
om62@group-009531620E4C->om61-AppendLogResponseHandler: follower responses
appendEntries COMPLETED
2024-03-04 04:14:28,704 INFO
[grpc-default-executor-30]-org.apache.ratis.grpc.server.GrpcLogAppender:
om62@group-009531620E4C->om61-AppendLogResponseHandler: follower responses
appendEntries COMPLETED
2024-03-04 04:14:30,316 WARN [Socket Reader #1 for port
9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for
10.140.105.13:60082:null (DIGEST-MD5: IO error acquiring password) with true
cause: (OM:om62 is not the leader. Could not determine the leader node.)
2024-03-04 04:14:30,586 ERROR
[qtp1802837169-142]-org.apache.hadoop.ozone.om.OzoneManager: No leader found
2024-03-04 04:14:31,193 WARN
[timer6]-org.apache.ratis.grpc.server.GrpcLogAppender:
om62@group-009531620E4C->om63-AppendLogResponseHandler: Failed appendEntries
(Repeated 2 times in the last 5.000s):
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io
exception
2024-03-04 04:14:31,194 WARN
[timer7]-org.apache.ratis.grpc.server.GrpcLogAppender:
om62@group-009531620E4C->om63-GrpcLogAppender: Follower failed (request=null,
errorCount=796564); keep nextIndex (209384) unchanged and retry. (Repeated 2
times in the last 5.000s)
2024-03-04 04:14:31,902 INFO
[om62@group-009531620E4C-FollowerState]-org.apache.ratis.server.impl.FollowerState:
om62@group-009531620E4C-FollowerState: Skipping leader election since it
stepped down recently (elapsed = 5199ms < waitTime = 10s)
2024-03-04 04:14:32,541 WARN [Socket Reader #1 for port
9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for
10.140.105.13:60094:null (DIGEST-MD5: IO error acquiring password) with true
cause: (OM:om62 is not the leader. Could not determine the leader node.)
2024-03-04 04:14:34,582 WARN [Socket Reader #1 for port
9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for
10.140.105.13:60102:null (DIGEST-MD5: IO error acquiring password) with true
cause: (OM:om62 is not the leader. Could not determine the leader node.)
2024-03-04 04:14:34,884 WARN [Socket Reader #1 for port
9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for
10.140.121.6:51180:null (DIGEST-MD5: IO error acquiring password) with true
cause: (OM:om62 is not the leader. Could not determine the leader node.)
2024-03-04 04:14:36,712 WARN [Socket Reader #1 for port
9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for
10.140.105.13:60112:null (DIGEST-MD5: IO error acquiring password) with true
cause: (OM:om62 is not the leader. Could not determine the leader node.)
2024-03-04 04:14:36,898 WARN [Socket Reader #1 for port
9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for
10.140.121.6:35118:null (DIGEST-MD5: IO error acquiring password) with true
cause: (OM:om62 is not the leader. Could not determine the leader node.)
2024-03-04 04:14:37,061 INFO
[om62@group-009531620E4C-FollowerState]-org.apache.ratis.server.impl.FollowerState:
om62@group-009531620E4C-FollowerState: change to CANDIDATE,
lastRpcElapsedTime:10360116624ns, electionTimeout:5159ms
2024-03-04 04:14:37,061 INFO
[om62@group-009531620E4C-FollowerState]-org.apache.ratis.server.impl.RoleInfo:
om62: shutdown om62@group-009531620E4C-FollowerState
2024-03-04 04:14:37,061 INFO
[om62@group-009531620E4C-FollowerState]-org.apache.ratis.server.RaftServer$Division:
om62@group-009531620E4C: changes role from FOLLOWER to CANDIDATE at term 19
for changeToCandidate
2024-03-04 04:14:37,066 INFO
[om62@group-009531620E4C-FollowerState]-org.apache.ratis.server.RaftServerConfigKeys:
raft.server.leaderelection.pre-vote = true (default)
2024-03-04 04:14:37,066 INFO
[om62@group-009531620E4C-FollowerState]-org.apache.ratis.server.impl.RoleInfo:
om62: start om62@group-009531620E4C-LeaderElection8
2024-03-04 04:14:37,068 INFO
[om62@group-009531620E4C-LeaderElection8]-org.apache.ratis.server.impl.LeaderElection:
om62@group-009531620E4C-LeaderElection8 PRE_VOTE round 0: submit vote requests
at term 19 for 142632:
peers:[om61|ccycloud-2.ozn-hbchf3oz.root.comops.site:9872,
om62|ccycloud-6.ozn-hbchf3oz.root.comops.site:9872,
om63|ccycloud-4.ozn-hbchf3oz.root.comops.site:9872]|listeners:[], old=null
2024-03-04 04:14:37,091 INFO
[om62@group-009531620E4C-LeaderElection8]-org.apache.ratis.server.impl.LeaderElection:
om62@group-009531620E4C-LeaderElection8 got exception when requesting votes:
java.util.concurrent.ExecutionException:
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io
exception
2024-03-04 04:14:38,524 WARN [Socket Reader #1 for port
9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for
10.140.8.131:60396:null (DIGEST-MD5: IO error acquiring password) with true
cause: (OM:om62 is not the leader. Could not determine the leader node.)
2024-03-04 04:14:38,801 WARN [Socket Reader #1 for port
9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for
10.140.105.13:60118:null (DIGEST-MD5: IO error acquiring password) with true
cause: (OM:om62 is not the leader. Could not determine the leader node.)
2024-03-04 04:14:39,165 WARN [Socket Reader #1 for port
9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for
10.140.121.6:35128:null (DIGEST-MD5: IO error acquiring password) with true
cause: (OM:om62 is not the leader. Could not determine the leader node.)
2024-03-04 04:14:39,370 INFO
[om62@group-009531620E4C-LeaderElection8]-org.apache.ratis.server.impl.LeaderElection:
om62@group-009531620E4C-LeaderElection8: PRE_VOTE PASSED received 1
response(s) and 1 exception(s):
2024-03-04 04:14:39,372 INFO
[om62@group-009531620E4C-LeaderElection8]-org.apache.ratis.server.impl.LeaderElection:
Response 0: om62<-om61#0:OK-t19
2024-03-04 04:14:39,373 INFO
[om62@group-009531620E4C-LeaderElection8]-org.apache.ratis.server.impl.LeaderElection:
Exception 1: java.util.concurrent.ExecutionException:
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io
exception
2024-03-04 04:14:39,373 INFO
[om62@group-009531620E4C-LeaderElection8]-org.apache.ratis.server.impl.LeaderElection:
om62@group-009531620E4C-LeaderElection8 PRE_VOTE round 0: result PASSED
2024-03-04 04:14:39,411 INFO
[om62@group-009531620E4C-LeaderElection8]-org.apache.ratis.server.impl.LeaderElection:
om62@group-009531620E4C-LeaderElection8 ELECTION round 0: submit vote requests
at term 20 for 142632:
peers:[om61|ccycloud-2.ozn-hbchf3oz.root.comops.site:9872,
om62|ccycloud-6.ozn-hbchf3oz.root.comops.site:9872,
om63|ccycloud-4.ozn-hbchf3oz.root.comops.site:9872]|listeners:[], old=null
2024-03-04 04:14:39,413 INFO
[om62@group-009531620E4C-LeaderElection8]-org.apache.ratis.server.impl.LeaderElection:
om62@group-009531620E4C-LeaderElection8 got exception when requesting votes:
java.util.concurrent.ExecutionException:
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io
exception
2024-03-04 04:14:39,551 WARN [Socket Reader #1 for port
9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for
10.140.8.131:59726:null (DIGEST-MD5: IO error acquiring password) with true
cause: (OM:om62 is not the leader. Could not determine the leader node.)
2024-03-04 04:14:41,081 WARN [Socket Reader #1 for port
9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for
10.140.105.13:59576:null (DIGEST-MD5: IO error acquiring password) with true
cause: (OM:om62 is not the leader. Could not determine the leader node.)
2024-03-04 04:14:41,085 WARN [Socket Reader #1 for port
9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for
10.140.8.131:59734:null (DIGEST-MD5: IO error acquiring password) with true
cause: (OM:om62 is not the leader. Could not determine the leader node.)
2024-03-04 04:14:41,376 WARN [Socket Reader #1 for port
9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for
10.140.121.6:35140:null (DIGEST-MD5: IO error acquiring password) with true
cause: (OM:om62 is not the leader. Could not determine the leader node.)
2024-03-04 04:14:41,574 WARN [Socket Reader #1 for port
9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for
10.140.8.131:59746:null (DIGEST-MD5: IO error acquiring password) with true
cause: (OM:om62 is not the leader. Could not determine the leader node.)
2024-03-04 04:14:41,919 WARN [Socket Reader #1 for port
9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for
10.140.8.131:59754:null (DIGEST-MD5: IO error acquiring password) with true
cause: (OM:om62 is not the leader. Could not determine the leader node.)
2024-03-04 04:14:42,414 INFO
[om62@group-009531620E4C-LeaderElection8]-org.apache.ratis.server.impl.LeaderElection:
om62@group-009531620E4C-LeaderElection8 got exception when requesting votes:
java.util.concurrent.ExecutionException:
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED:
deadline exceeded after 2.999711304s.
[remote_addr=ccycloud-2.ozn-hbchf3oz.root.comops.site/10.140.146.6:9872]
2024-03-04 04:14:42,414 INFO
[om62@group-009531620E4C-LeaderElection8]-org.apache.ratis.server.impl.LeaderElection:
om62@group-009531620E4C-LeaderElection8: ELECTION REJECTED received 0
response(s) and 2 exception(s):
2024-03-04 04:14:42,414 INFO
[om62@group-009531620E4C-LeaderElection8]-org.apache.ratis.server.impl.LeaderElection:
Exception 0: java.util.concurrent.ExecutionException:
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io
exception
2024-03-04 04:14:42,414 INFO
[om62@group-009531620E4C-LeaderElection8]-org.apache.ratis.server.impl.LeaderElection:
Exception 1: java.util.concurrent.ExecutionException:
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED:
deadline exceeded after 2.999711304s.
[remote_addr=ccycloud-2.ozn-hbchf3oz.root.comops.site/10.140.146.6:9872]
2024-03-04 04:14:42,414 INFO
[om62@group-009531620E4C-LeaderElection8]-org.apache.ratis.server.impl.LeaderElection:
om62@group-009531620E4C-LeaderElection8 ELECTION round 0: result REJECTED
2024-03-04 04:14:42,414 INFO
[om62@group-009531620E4C-LeaderElection8]-org.apache.ratis.server.RaftServer$Division:
om62@group-009531620E4C: changes role from CANDIDATE to FOLLOWER at term 20
for REJECTED
2024-03-04 04:14:42,414 INFO
[om62@group-009531620E4C-LeaderElection8]-org.apache.ratis.server.impl.RoleInfo:
om62: shutdown om62@group-009531620E4C-LeaderElection8
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)