[
https://issues.apache.org/jira/browse/RATIS-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tsz-wo Sze resolved RATIS-2039.
-------------------------------
Resolution: Not A Problem
Resolving this as "not a problem".
> Leader lection requests should apply timeout close to election timeout
> ----------------------------------------------------------------------
>
> Key: RATIS-2039
> URL: https://issues.apache.org/jira/browse/RATIS-2039
> Project: Ratis
> Issue Type: Improvement
> Reporter: Sammi Chen
> Priority: Major
>
> In the following log, election timeout is 5s. om62 election failed when its
> vote request timeouted at GRPC level with 3s although it passed the pre-vote
> process. After the request timeout, there is no retry, although it's still
> 2s left toward election timeout. The ideal state is om62 requests use the
> same(or a little bit smaller than) timeout as election timeout, so that in
> the following case, the election will finally got response and finish
> successfully.
> {code:java}
> 2024-03-04 04:14:21,186 WARN
> [grpc-default-executor-30]-org.apache.ratis.grpc.server.GrpcLogAppender:
> om62@group-009531620E4C->om63-AppendLogResponseHandler: Failed appendEntries:
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io
> exception
> 2024-03-04 04:14:21,188 WARN
> [grpc-default-executor-30]-org.apache.ratis.grpc.server.GrpcLogAppender:
> om62@group-009531620E4C->om63-GrpcLogAppender: Follower failed (request=null,
> errorCount=796555); keep nextIndex (209384) unchanged and retry.
> 2024-03-04 04:14:25,169 INFO [Socket Reader #1 for port
> 9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for
> [email protected] (auth:TOKEN)
> 2024-03-04 04:14:25,171 INFO [Socket Reader #1 for port
> 9862]-SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
> Authorization successful for [email protected] (auth:TOKEN) for
> protocol=interface org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol
> 2024-03-04 04:14:26,186 WARN
> [timer5]-org.apache.ratis.grpc.server.GrpcLogAppender:
> om62@group-009531620E4C->om63-AppendLogResponseHandler: Failed appendEntries
> (Repeated 8 times in the last 5.000s):
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io
> exception
> 2024-03-04 04:14:26,188 WARN
> [timer6]-org.apache.ratis.grpc.server.GrpcLogAppender:
> om62@group-009531620E4C->om63-GrpcLogAppender: Follower failed (request=null,
> errorCount=796562); keep nextIndex (209384) unchanged and retry. (Repeated 8
> times in the last 5.000s)
> 2024-03-04 04:14:26,193 WARN
> [grpc-default-executor-21]-org.apache.ratis.grpc.server.GrpcLogAppender:
> om62@group-009531620E4C->om63-AppendLogResponseHandler: Failed appendEntries:
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io
> exception
> 2024-03-04 04:14:26,194 WARN
> [grpc-default-executor-21]-org.apache.ratis.grpc.server.GrpcLogAppender:
> om62@group-009531620E4C->om63-GrpcLogAppender: Follower failed (request=null,
> errorCount=796563); keep nextIndex (209384) unchanged and retry.
> 2024-03-04 04:14:26,261 WARN
> [timer6]-org.apache.ratis.grpc.server.GrpcLogAppender:
> om62@group-009531620E4C->om61-GrpcLogAppender: Timed out HEARTBEAT
> appendEntries, errorCount=1,
> request=AppendEntriesRequest:cid=254709,entriesCount=0
> 2024-03-04 04:14:26,653 WARN
> [om62@group-009531620E4C-LeaderStateImpl]-org.apache.ratis.server.RaftServer$Division:
> om62@group-009531620E4C-LeaderStateImpl: Lost leadership on term: 19.
> Election timeout: 5200ms. In charge for: 415277057ms. Conf: 142632:
> peers:[om61|ccycloud-2.ozn-hbchf3oz.root.comops.site:9872,
> om62|ccycloud-6.ozn-hbchf3oz.root.comops.site:9872,
> om63|ccycloud-4.ozn-hbchf3oz.root.comops.site:9872]|listeners:[], old=null
> 2024-03-04 04:14:26,659 WARN
> [om62@group-009531620E4C-LeaderStateImpl]-org.apache.ratis.server.RaftServer$Division:
> Follower om62@group-009531620E4C->om61(c209383,m209383,n209384,
> caughtUp=true, lastRpcSendTime=898, lastRpcResponseTime=5900)
> 2024-03-04 04:14:26,659 WARN
> [om62@group-009531620E4C-LeaderStateImpl]-org.apache.ratis.server.RaftServer$Division:
> Follower om62@group-009531620E4C->om63(c-1,m-1,n209384, caughtUp=true,
> lastRpcSendTime=464, lastRpcResponseTime=415282260)
> 2024-03-04 04:14:26,661 INFO
> [om62@group-009531620E4C-LeaderStateImpl]-org.apache.ratis.server.RaftServer$Division:
> om62@group-009531620E4C: changes role from LEADER to FOLLOWER at term 19
> for StepDownReason:LOST_MAJORITY_HEARTBEATS
> 2024-03-04 04:14:26,662 INFO
> [om62@group-009531620E4C-LeaderStateImpl]-org.apache.ratis.server.impl.RoleInfo:
> om62: shutdown om62@group-009531620E4C-LeaderStateImpl
> 2024-03-04 04:14:26,681 WARN
> [om62@group-009531620E4C->om63-GrpcLogAppender-LogAppenderDaemon]-org.apache.ratis.grpc.server.GrpcLogAppender:
> om62@group-009531620E4C->om63-GrpcLogAppender: Wait interrupted by
> java.lang.InterruptedException
> 2024-03-04 04:14:26,681 WARN
> [om62@group-009531620E4C->om61-GrpcLogAppender-LogAppenderDaemon]-org.apache.ratis.grpc.server.GrpcLogAppender:
> om62@group-009531620E4C->om61-GrpcLogAppender: Wait interrupted by
> java.lang.InterruptedException
> 2024-03-04 04:14:26,688 INFO
> [om62@group-009531620E4C-LeaderStateImpl]-org.apache.ratis.server.impl.PendingRequests:
> om62@group-009531620E4C-PendingRequests: sendNotLeaderResponses
> 2024-03-04 04:14:26,700 INFO
> [om62@group-009531620E4C-LeaderStateImpl]-org.apache.ratis.server.RaftServer$Division:
> om62@group-009531620E4C: change Leader from om62 to null at term 19 for
> StepDownReason:LOST_MAJORITY_HEARTBEATS
> 2024-03-04 04:14:26,702 INFO
> [om62@group-009531620E4C-LeaderStateImpl]-org.apache.ratis.server.impl.RoleInfo:
> om62: start om62@group-009531620E4C-FollowerState
> 2024-03-04 04:14:27,443 WARN [Socket Reader #1 for port
> 9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for
> 10.140.105.13:51482:null (DIGEST-MD5: IO error acquiring password) with true
> cause: (OM:om62 is not the leader. Could not determine the leader node.)
> 2024-03-04 04:14:27,510 WARN
> [timer6]-org.apache.ratis.grpc.server.GrpcLogAppender:
> om62@group-009531620E4C->om61-GrpcLogAppender: Timed out HEARTBEAT
> appendEntries, errorCount=1,
> request=AppendEntriesRequest:cid=254710,entriesCount=0
> 2024-03-04 04:14:28,656 INFO
> [grpc-default-executor-30]-org.apache.ratis.grpc.server.GrpcLogAppender:
> om62@group-009531620E4C->om61-AppendLogResponseHandler: follower responses
> appendEntries COMPLETED
> 2024-03-04 04:14:28,704 INFO
> [grpc-default-executor-30]-org.apache.ratis.grpc.server.GrpcLogAppender:
> om62@group-009531620E4C->om61-AppendLogResponseHandler: follower responses
> appendEntries COMPLETED
> 2024-03-04 04:14:30,316 WARN [Socket Reader #1 for port
> 9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for
> 10.140.105.13:60082:null (DIGEST-MD5: IO error acquiring password) with true
> cause: (OM:om62 is not the leader. Could not determine the leader node.)
> 2024-03-04 04:14:30,586 ERROR
> [qtp1802837169-142]-org.apache.hadoop.ozone.om.OzoneManager: No leader found
> 2024-03-04 04:14:31,193 WARN
> [timer6]-org.apache.ratis.grpc.server.GrpcLogAppender:
> om62@group-009531620E4C->om63-AppendLogResponseHandler: Failed appendEntries
> (Repeated 2 times in the last 5.000s):
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io
> exception
> 2024-03-04 04:14:31,194 WARN
> [timer7]-org.apache.ratis.grpc.server.GrpcLogAppender:
> om62@group-009531620E4C->om63-GrpcLogAppender: Follower failed (request=null,
> errorCount=796564); keep nextIndex (209384) unchanged and retry. (Repeated 2
> times in the last 5.000s)
> 2024-03-04 04:14:31,902 INFO
> [om62@group-009531620E4C-FollowerState]-org.apache.ratis.server.impl.FollowerState:
> om62@group-009531620E4C-FollowerState: Skipping leader election since it
> stepped down recently (elapsed = 5199ms < waitTime = 10s)
> 2024-03-04 04:14:32,541 WARN [Socket Reader #1 for port
> 9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for
> 10.140.105.13:60094:null (DIGEST-MD5: IO error acquiring password) with true
> cause: (OM:om62 is not the leader. Could not determine the leader node.)
> 2024-03-04 04:14:34,582 WARN [Socket Reader #1 for port
> 9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for
> 10.140.105.13:60102:null (DIGEST-MD5: IO error acquiring password) with true
> cause: (OM:om62 is not the leader. Could not determine the leader node.)
> 2024-03-04 04:14:34,884 WARN [Socket Reader #1 for port
> 9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for
> 10.140.121.6:51180:null (DIGEST-MD5: IO error acquiring password) with true
> cause: (OM:om62 is not the leader. Could not determine the leader node.)
> 2024-03-04 04:14:36,712 WARN [Socket Reader #1 for port
> 9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for
> 10.140.105.13:60112:null (DIGEST-MD5: IO error acquiring password) with true
> cause: (OM:om62 is not the leader. Could not determine the leader node.)
> 2024-03-04 04:14:36,898 WARN [Socket Reader #1 for port
> 9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for
> 10.140.121.6:35118:null (DIGEST-MD5: IO error acquiring password) with true
> cause: (OM:om62 is not the leader. Could not determine the leader node.)
> 2024-03-04 04:14:37,061 INFO
> [om62@group-009531620E4C-FollowerState]-org.apache.ratis.server.impl.FollowerState:
> om62@group-009531620E4C-FollowerState: change to CANDIDATE,
> lastRpcElapsedTime:10360116624ns, electionTimeout:5159ms
> 2024-03-04 04:14:37,061 INFO
> [om62@group-009531620E4C-FollowerState]-org.apache.ratis.server.impl.RoleInfo:
> om62: shutdown om62@group-009531620E4C-FollowerState
> 2024-03-04 04:14:37,061 INFO
> [om62@group-009531620E4C-FollowerState]-org.apache.ratis.server.RaftServer$Division:
> om62@group-009531620E4C: changes role from FOLLOWER to CANDIDATE at term 19
> for changeToCandidate
> 2024-03-04 04:14:37,066 INFO
> [om62@group-009531620E4C-FollowerState]-org.apache.ratis.server.RaftServerConfigKeys:
> raft.server.leaderelection.pre-vote = true (default)
> 2024-03-04 04:14:37,066 INFO
> [om62@group-009531620E4C-FollowerState]-org.apache.ratis.server.impl.RoleInfo:
> om62: start om62@group-009531620E4C-LeaderElection8
> 2024-03-04 04:14:37,068 INFO
> [om62@group-009531620E4C-LeaderElection8]-org.apache.ratis.server.impl.LeaderElection:
> om62@group-009531620E4C-LeaderElection8 PRE_VOTE round 0: submit vote
> requests at term 19 for 142632:
> peers:[om61|ccycloud-2.ozn-hbchf3oz.root.comops.site:9872,
> om62|ccycloud-6.ozn-hbchf3oz.root.comops.site:9872,
> om63|ccycloud-4.ozn-hbchf3oz.root.comops.site:9872]|listeners:[], old=null
> 2024-03-04 04:14:37,091 INFO
> [om62@group-009531620E4C-LeaderElection8]-org.apache.ratis.server.impl.LeaderElection:
> om62@group-009531620E4C-LeaderElection8 got exception when requesting votes:
> java.util.concurrent.ExecutionException:
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io
> exception
> 2024-03-04 04:14:38,524 WARN [Socket Reader #1 for port
> 9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for
> 10.140.8.131:60396:null (DIGEST-MD5: IO error acquiring password) with true
> cause: (OM:om62 is not the leader. Could not determine the leader node.)
> 2024-03-04 04:14:38,801 WARN [Socket Reader #1 for port
> 9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for
> 10.140.105.13:60118:null (DIGEST-MD5: IO error acquiring password) with true
> cause: (OM:om62 is not the leader. Could not determine the leader node.)
> 2024-03-04 04:14:39,165 WARN [Socket Reader #1 for port
> 9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for
> 10.140.121.6:35128:null (DIGEST-MD5: IO error acquiring password) with true
> cause: (OM:om62 is not the leader. Could not determine the leader node.)
> 2024-03-04 04:14:39,370 INFO
> [om62@group-009531620E4C-LeaderElection8]-org.apache.ratis.server.impl.LeaderElection:
> om62@group-009531620E4C-LeaderElection8: PRE_VOTE PASSED received 1
> response(s) and 1 exception(s):
> 2024-03-04 04:14:39,372 INFO
> [om62@group-009531620E4C-LeaderElection8]-org.apache.ratis.server.impl.LeaderElection:
> Response 0: om62<-om61#0:OK-t19
> 2024-03-04 04:14:39,373 INFO
> [om62@group-009531620E4C-LeaderElection8]-org.apache.ratis.server.impl.LeaderElection:
> Exception 1: java.util.concurrent.ExecutionException:
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io
> exception
> 2024-03-04 04:14:39,373 INFO
> [om62@group-009531620E4C-LeaderElection8]-org.apache.ratis.server.impl.LeaderElection:
> om62@group-009531620E4C-LeaderElection8 PRE_VOTE round 0: result PASSED
> 2024-03-04 04:14:39,411 INFO
> [om62@group-009531620E4C-LeaderElection8]-org.apache.ratis.server.impl.LeaderElection:
> om62@group-009531620E4C-LeaderElection8 ELECTION round 0: submit vote
> requests at term 20 for 142632:
> peers:[om61|ccycloud-2.ozn-hbchf3oz.root.comops.site:9872,
> om62|ccycloud-6.ozn-hbchf3oz.root.comops.site:9872,
> om63|ccycloud-4.ozn-hbchf3oz.root.comops.site:9872]|listeners:[], old=null
> 2024-03-04 04:14:39,413 INFO
> [om62@group-009531620E4C-LeaderElection8]-org.apache.ratis.server.impl.LeaderElection:
> om62@group-009531620E4C-LeaderElection8 got exception when requesting votes:
> java.util.concurrent.ExecutionException:
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io
> exception
> 2024-03-04 04:14:39,551 WARN [Socket Reader #1 for port
> 9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for
> 10.140.8.131:59726:null (DIGEST-MD5: IO error acquiring password) with true
> cause: (OM:om62 is not the leader. Could not determine the leader node.)
> 2024-03-04 04:14:41,081 WARN [Socket Reader #1 for port
> 9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for
> 10.140.105.13:59576:null (DIGEST-MD5: IO error acquiring password) with true
> cause: (OM:om62 is not the leader. Could not determine the leader node.)
> 2024-03-04 04:14:41,085 WARN [Socket Reader #1 for port
> 9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for
> 10.140.8.131:59734:null (DIGEST-MD5: IO error acquiring password) with true
> cause: (OM:om62 is not the leader. Could not determine the leader node.)
> 2024-03-04 04:14:41,376 WARN [Socket Reader #1 for port
> 9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for
> 10.140.121.6:35140:null (DIGEST-MD5: IO error acquiring password) with true
> cause: (OM:om62 is not the leader. Could not determine the leader node.)
> 2024-03-04 04:14:41,574 WARN [Socket Reader #1 for port
> 9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for
> 10.140.8.131:59746:null (DIGEST-MD5: IO error acquiring password) with true
> cause: (OM:om62 is not the leader. Could not determine the leader node.)
> 2024-03-04 04:14:41,919 WARN [Socket Reader #1 for port
> 9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for
> 10.140.8.131:59754:null (DIGEST-MD5: IO error acquiring password) with true
> cause: (OM:om62 is not the leader. Could not determine the leader node.)
> 2024-03-04 04:14:42,414 INFO
> [om62@group-009531620E4C-LeaderElection8]-org.apache.ratis.server.impl.LeaderElection:
> om62@group-009531620E4C-LeaderElection8 got exception when requesting votes:
> java.util.concurrent.ExecutionException:
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
> DEADLINE_EXCEEDED: deadline exceeded after 2.999711304s.
> [remote_addr=ccycloud-2.ozn-hbchf3oz.root.comops.site/10.140.146.6:9872]
> 2024-03-04 04:14:42,414 INFO
> [om62@group-009531620E4C-LeaderElection8]-org.apache.ratis.server.impl.LeaderElection:
> om62@group-009531620E4C-LeaderElection8: ELECTION REJECTED received 0
> response(s) and 2 exception(s):
> 2024-03-04 04:14:42,414 INFO
> [om62@group-009531620E4C-LeaderElection8]-org.apache.ratis.server.impl.LeaderElection:
> Exception 0: java.util.concurrent.ExecutionException:
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io
> exception
> 2024-03-04 04:14:42,414 INFO
> [om62@group-009531620E4C-LeaderElection8]-org.apache.ratis.server.impl.LeaderElection:
> Exception 1: java.util.concurrent.ExecutionException:
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
> DEADLINE_EXCEEDED: deadline exceeded after 2.999711304s.
> [remote_addr=ccycloud-2.ozn-hbchf3oz.root.comops.site/10.140.146.6:9872]
> 2024-03-04 04:14:42,414 INFO
> [om62@group-009531620E4C-LeaderElection8]-org.apache.ratis.server.impl.LeaderElection:
> om62@group-009531620E4C-LeaderElection8 ELECTION round 0: result REJECTED
> 2024-03-04 04:14:42,414 INFO
> [om62@group-009531620E4C-LeaderElection8]-org.apache.ratis.server.RaftServer$Division:
> om62@group-009531620E4C: changes role from CANDIDATE to FOLLOWER at term 20
> for REJECTED
> 2024-03-04 04:14:42,414 INFO
> [om62@group-009531620E4C-LeaderElection8]-org.apache.ratis.server.impl.RoleInfo:
> om62: shutdown om62@group-009531620E4C-LeaderElection8
> {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)