Sammi Chen created RATIS-2039:
---------------------------------

             Summary: Leader lection requests should apply timeout close to 
election timeout
                 Key: RATIS-2039
                 URL: https://issues.apache.org/jira/browse/RATIS-2039
             Project: Ratis
          Issue Type: Improvement
            Reporter: Sammi Chen


In the following log,  election timeout is 5s.  om62 election failed when its 
vote request timeouted at GRPC level with 3s although it passed the pre-vote 
process.  After the request timeout, there is no retry, although it's still 2s 
left toward election timeout. The ideal state is om62 requests use the same(or 
a little bit smaller than) timeout as election timeout, so that in the 
following case, the election will finally got response and finish successfully. 
{code:java}
2024-03-04 04:14:21,186 WARN 
[grpc-default-executor-30]-org.apache.ratis.grpc.server.GrpcLogAppender: 
om62@group-009531620E4C->om63-AppendLogResponseHandler: Failed appendEntries: 
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
exception
2024-03-04 04:14:21,188 WARN 
[grpc-default-executor-30]-org.apache.ratis.grpc.server.GrpcLogAppender: 
om62@group-009531620E4C->om63-GrpcLogAppender: Follower failed (request=null, 
errorCount=796555); keep nextIndex (209384) unchanged and retry.
2024-03-04 04:14:25,169 INFO [Socket Reader #1 for port 
9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for 
[email protected] (auth:TOKEN)
2024-03-04 04:14:25,171 INFO [Socket Reader #1 for port 
9862]-SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
 Authorization successful for [email protected] (auth:TOKEN) for 
protocol=interface org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol
2024-03-04 04:14:26,186 WARN 
[timer5]-org.apache.ratis.grpc.server.GrpcLogAppender: 
om62@group-009531620E4C->om63-AppendLogResponseHandler: Failed appendEntries 
(Repeated 8 times in the last 5.000s): 
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
exception
2024-03-04 04:14:26,188 WARN 
[timer6]-org.apache.ratis.grpc.server.GrpcLogAppender: 
om62@group-009531620E4C->om63-GrpcLogAppender: Follower failed (request=null, 
errorCount=796562); keep nextIndex (209384) unchanged and retry. (Repeated 8 
times in the last 5.000s)
2024-03-04 04:14:26,193 WARN 
[grpc-default-executor-21]-org.apache.ratis.grpc.server.GrpcLogAppender: 
om62@group-009531620E4C->om63-AppendLogResponseHandler: Failed appendEntries: 
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
exception
2024-03-04 04:14:26,194 WARN 
[grpc-default-executor-21]-org.apache.ratis.grpc.server.GrpcLogAppender: 
om62@group-009531620E4C->om63-GrpcLogAppender: Follower failed (request=null, 
errorCount=796563); keep nextIndex (209384) unchanged and retry.
2024-03-04 04:14:26,261 WARN 
[timer6]-org.apache.ratis.grpc.server.GrpcLogAppender: 
om62@group-009531620E4C->om61-GrpcLogAppender: Timed out HEARTBEAT 
appendEntries, errorCount=1, 
request=AppendEntriesRequest:cid=254709,entriesCount=0
2024-03-04 04:14:26,653 WARN 
[om62@group-009531620E4C-LeaderStateImpl]-org.apache.ratis.server.RaftServer$Division:
 om62@group-009531620E4C-LeaderStateImpl: Lost leadership on term: 19. Election 
timeout: 5200ms. In charge for: 415277057ms. Conf: 142632: 
peers:[om61|ccycloud-2.ozn-hbchf3oz.root.comops.site:9872, 
om62|ccycloud-6.ozn-hbchf3oz.root.comops.site:9872, 
om63|ccycloud-4.ozn-hbchf3oz.root.comops.site:9872]|listeners:[], old=null
2024-03-04 04:14:26,659 WARN 
[om62@group-009531620E4C-LeaderStateImpl]-org.apache.ratis.server.RaftServer$Division:
 Follower om62@group-009531620E4C->om61(c209383,m209383,n209384, caughtUp=true, 
lastRpcSendTime=898, lastRpcResponseTime=5900)
2024-03-04 04:14:26,659 WARN 
[om62@group-009531620E4C-LeaderStateImpl]-org.apache.ratis.server.RaftServer$Division:
 Follower om62@group-009531620E4C->om63(c-1,m-1,n209384, caughtUp=true, 
lastRpcSendTime=464, lastRpcResponseTime=415282260)
2024-03-04 04:14:26,661 INFO 
[om62@group-009531620E4C-LeaderStateImpl]-org.apache.ratis.server.RaftServer$Division:
 om62@group-009531620E4C: changes role from    LEADER to FOLLOWER at term 19 
for StepDownReason:LOST_MAJORITY_HEARTBEATS
2024-03-04 04:14:26,662 INFO 
[om62@group-009531620E4C-LeaderStateImpl]-org.apache.ratis.server.impl.RoleInfo:
 om62: shutdown om62@group-009531620E4C-LeaderStateImpl
2024-03-04 04:14:26,681 WARN 
[om62@group-009531620E4C->om63-GrpcLogAppender-LogAppenderDaemon]-org.apache.ratis.grpc.server.GrpcLogAppender:
 om62@group-009531620E4C->om63-GrpcLogAppender: Wait interrupted by 
java.lang.InterruptedException
2024-03-04 04:14:26,681 WARN 
[om62@group-009531620E4C->om61-GrpcLogAppender-LogAppenderDaemon]-org.apache.ratis.grpc.server.GrpcLogAppender:
 om62@group-009531620E4C->om61-GrpcLogAppender: Wait interrupted by 
java.lang.InterruptedException
2024-03-04 04:14:26,688 INFO 
[om62@group-009531620E4C-LeaderStateImpl]-org.apache.ratis.server.impl.PendingRequests:
 om62@group-009531620E4C-PendingRequests: sendNotLeaderResponses
2024-03-04 04:14:26,700 INFO 
[om62@group-009531620E4C-LeaderStateImpl]-org.apache.ratis.server.RaftServer$Division:
 om62@group-009531620E4C: change Leader from om62 to null at term 19 for 
StepDownReason:LOST_MAJORITY_HEARTBEATS
2024-03-04 04:14:26,702 INFO 
[om62@group-009531620E4C-LeaderStateImpl]-org.apache.ratis.server.impl.RoleInfo:
 om62: start om62@group-009531620E4C-FollowerState
2024-03-04 04:14:27,443 WARN [Socket Reader #1 for port 
9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for 
10.140.105.13:51482:null (DIGEST-MD5: IO error acquiring password) with true 
cause: (OM:om62 is not the leader. Could not determine the leader node.)
2024-03-04 04:14:27,510 WARN 
[timer6]-org.apache.ratis.grpc.server.GrpcLogAppender: 
om62@group-009531620E4C->om61-GrpcLogAppender: Timed out HEARTBEAT 
appendEntries, errorCount=1, 
request=AppendEntriesRequest:cid=254710,entriesCount=0
2024-03-04 04:14:28,656 INFO 
[grpc-default-executor-30]-org.apache.ratis.grpc.server.GrpcLogAppender: 
om62@group-009531620E4C->om61-AppendLogResponseHandler: follower responses 
appendEntries COMPLETED
2024-03-04 04:14:28,704 INFO 
[grpc-default-executor-30]-org.apache.ratis.grpc.server.GrpcLogAppender: 
om62@group-009531620E4C->om61-AppendLogResponseHandler: follower responses 
appendEntries COMPLETED
2024-03-04 04:14:30,316 WARN [Socket Reader #1 for port 
9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for 
10.140.105.13:60082:null (DIGEST-MD5: IO error acquiring password) with true 
cause: (OM:om62 is not the leader. Could not determine the leader node.)
2024-03-04 04:14:30,586 ERROR 
[qtp1802837169-142]-org.apache.hadoop.ozone.om.OzoneManager: No leader found
2024-03-04 04:14:31,193 WARN 
[timer6]-org.apache.ratis.grpc.server.GrpcLogAppender: 
om62@group-009531620E4C->om63-AppendLogResponseHandler: Failed appendEntries 
(Repeated 2 times in the last 5.000s): 
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
exception
2024-03-04 04:14:31,194 WARN 
[timer7]-org.apache.ratis.grpc.server.GrpcLogAppender: 
om62@group-009531620E4C->om63-GrpcLogAppender: Follower failed (request=null, 
errorCount=796564); keep nextIndex (209384) unchanged and retry. (Repeated 2 
times in the last 5.000s)
2024-03-04 04:14:31,902 INFO 
[om62@group-009531620E4C-FollowerState]-org.apache.ratis.server.impl.FollowerState:
 om62@group-009531620E4C-FollowerState: Skipping leader election since it 
stepped down recently (elapsed = 5199ms < waitTime = 10s)
2024-03-04 04:14:32,541 WARN [Socket Reader #1 for port 
9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for 
10.140.105.13:60094:null (DIGEST-MD5: IO error acquiring password) with true 
cause: (OM:om62 is not the leader. Could not determine the leader node.)
2024-03-04 04:14:34,582 WARN [Socket Reader #1 for port 
9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for 
10.140.105.13:60102:null (DIGEST-MD5: IO error acquiring password) with true 
cause: (OM:om62 is not the leader. Could not determine the leader node.)
2024-03-04 04:14:34,884 WARN [Socket Reader #1 for port 
9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for 
10.140.121.6:51180:null (DIGEST-MD5: IO error acquiring password) with true 
cause: (OM:om62 is not the leader. Could not determine the leader node.)
2024-03-04 04:14:36,712 WARN [Socket Reader #1 for port 
9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for 
10.140.105.13:60112:null (DIGEST-MD5: IO error acquiring password) with true 
cause: (OM:om62 is not the leader. Could not determine the leader node.)
2024-03-04 04:14:36,898 WARN [Socket Reader #1 for port 
9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for 
10.140.121.6:35118:null (DIGEST-MD5: IO error acquiring password) with true 
cause: (OM:om62 is not the leader. Could not determine the leader node.)
2024-03-04 04:14:37,061 INFO 
[om62@group-009531620E4C-FollowerState]-org.apache.ratis.server.impl.FollowerState:
 om62@group-009531620E4C-FollowerState: change to CANDIDATE, 
lastRpcElapsedTime:10360116624ns, electionTimeout:5159ms
2024-03-04 04:14:37,061 INFO 
[om62@group-009531620E4C-FollowerState]-org.apache.ratis.server.impl.RoleInfo: 
om62: shutdown om62@group-009531620E4C-FollowerState
2024-03-04 04:14:37,061 INFO 
[om62@group-009531620E4C-FollowerState]-org.apache.ratis.server.RaftServer$Division:
 om62@group-009531620E4C: changes role from  FOLLOWER to CANDIDATE at term 19 
for changeToCandidate
2024-03-04 04:14:37,066 INFO 
[om62@group-009531620E4C-FollowerState]-org.apache.ratis.server.RaftServerConfigKeys:
 raft.server.leaderelection.pre-vote = true (default)
2024-03-04 04:14:37,066 INFO 
[om62@group-009531620E4C-FollowerState]-org.apache.ratis.server.impl.RoleInfo: 
om62: start om62@group-009531620E4C-LeaderElection8
2024-03-04 04:14:37,068 INFO 
[om62@group-009531620E4C-LeaderElection8]-org.apache.ratis.server.impl.LeaderElection:
 om62@group-009531620E4C-LeaderElection8 PRE_VOTE round 0: submit vote requests 
at term 19 for 142632: 
peers:[om61|ccycloud-2.ozn-hbchf3oz.root.comops.site:9872, 
om62|ccycloud-6.ozn-hbchf3oz.root.comops.site:9872, 
om63|ccycloud-4.ozn-hbchf3oz.root.comops.site:9872]|listeners:[], old=null
2024-03-04 04:14:37,091 INFO 
[om62@group-009531620E4C-LeaderElection8]-org.apache.ratis.server.impl.LeaderElection:
 om62@group-009531620E4C-LeaderElection8 got exception when requesting votes: 
java.util.concurrent.ExecutionException: 
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
exception
2024-03-04 04:14:38,524 WARN [Socket Reader #1 for port 
9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for 
10.140.8.131:60396:null (DIGEST-MD5: IO error acquiring password) with true 
cause: (OM:om62 is not the leader. Could not determine the leader node.)
2024-03-04 04:14:38,801 WARN [Socket Reader #1 for port 
9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for 
10.140.105.13:60118:null (DIGEST-MD5: IO error acquiring password) with true 
cause: (OM:om62 is not the leader. Could not determine the leader node.)
2024-03-04 04:14:39,165 WARN [Socket Reader #1 for port 
9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for 
10.140.121.6:35128:null (DIGEST-MD5: IO error acquiring password) with true 
cause: (OM:om62 is not the leader. Could not determine the leader node.)
2024-03-04 04:14:39,370 INFO 
[om62@group-009531620E4C-LeaderElection8]-org.apache.ratis.server.impl.LeaderElection:
 om62@group-009531620E4C-LeaderElection8: PRE_VOTE PASSED received 1 
response(s) and 1 exception(s):
2024-03-04 04:14:39,372 INFO 
[om62@group-009531620E4C-LeaderElection8]-org.apache.ratis.server.impl.LeaderElection:
   Response 0: om62<-om61#0:OK-t19
2024-03-04 04:14:39,373 INFO 
[om62@group-009531620E4C-LeaderElection8]-org.apache.ratis.server.impl.LeaderElection:
   Exception 1: java.util.concurrent.ExecutionException: 
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
exception
2024-03-04 04:14:39,373 INFO 
[om62@group-009531620E4C-LeaderElection8]-org.apache.ratis.server.impl.LeaderElection:
 om62@group-009531620E4C-LeaderElection8 PRE_VOTE round 0: result PASSED
2024-03-04 04:14:39,411 INFO 
[om62@group-009531620E4C-LeaderElection8]-org.apache.ratis.server.impl.LeaderElection:
 om62@group-009531620E4C-LeaderElection8 ELECTION round 0: submit vote requests 
at term 20 for 142632: 
peers:[om61|ccycloud-2.ozn-hbchf3oz.root.comops.site:9872, 
om62|ccycloud-6.ozn-hbchf3oz.root.comops.site:9872, 
om63|ccycloud-4.ozn-hbchf3oz.root.comops.site:9872]|listeners:[], old=null
2024-03-04 04:14:39,413 INFO 
[om62@group-009531620E4C-LeaderElection8]-org.apache.ratis.server.impl.LeaderElection:
 om62@group-009531620E4C-LeaderElection8 got exception when requesting votes: 
java.util.concurrent.ExecutionException: 
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
exception
2024-03-04 04:14:39,551 WARN [Socket Reader #1 for port 
9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for 
10.140.8.131:59726:null (DIGEST-MD5: IO error acquiring password) with true 
cause: (OM:om62 is not the leader. Could not determine the leader node.)
2024-03-04 04:14:41,081 WARN [Socket Reader #1 for port 
9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for 
10.140.105.13:59576:null (DIGEST-MD5: IO error acquiring password) with true 
cause: (OM:om62 is not the leader. Could not determine the leader node.)
2024-03-04 04:14:41,085 WARN [Socket Reader #1 for port 
9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for 
10.140.8.131:59734:null (DIGEST-MD5: IO error acquiring password) with true 
cause: (OM:om62 is not the leader. Could not determine the leader node.)
2024-03-04 04:14:41,376 WARN [Socket Reader #1 for port 
9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for 
10.140.121.6:35140:null (DIGEST-MD5: IO error acquiring password) with true 
cause: (OM:om62 is not the leader. Could not determine the leader node.)
2024-03-04 04:14:41,574 WARN [Socket Reader #1 for port 
9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for 
10.140.8.131:59746:null (DIGEST-MD5: IO error acquiring password) with true 
cause: (OM:om62 is not the leader. Could not determine the leader node.)
2024-03-04 04:14:41,919 WARN [Socket Reader #1 for port 
9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for 
10.140.8.131:59754:null (DIGEST-MD5: IO error acquiring password) with true 
cause: (OM:om62 is not the leader. Could not determine the leader node.)
2024-03-04 04:14:42,414 INFO 
[om62@group-009531620E4C-LeaderElection8]-org.apache.ratis.server.impl.LeaderElection:
 om62@group-009531620E4C-LeaderElection8 got exception when requesting votes: 
java.util.concurrent.ExecutionException: 
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: 
deadline exceeded after 2.999711304s. 
[remote_addr=ccycloud-2.ozn-hbchf3oz.root.comops.site/10.140.146.6:9872]
2024-03-04 04:14:42,414 INFO 
[om62@group-009531620E4C-LeaderElection8]-org.apache.ratis.server.impl.LeaderElection:
 om62@group-009531620E4C-LeaderElection8: ELECTION REJECTED received 0 
response(s) and 2 exception(s):
2024-03-04 04:14:42,414 INFO 
[om62@group-009531620E4C-LeaderElection8]-org.apache.ratis.server.impl.LeaderElection:
   Exception 0: java.util.concurrent.ExecutionException: 
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
exception
2024-03-04 04:14:42,414 INFO 
[om62@group-009531620E4C-LeaderElection8]-org.apache.ratis.server.impl.LeaderElection:
   Exception 1: java.util.concurrent.ExecutionException: 
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: 
deadline exceeded after 2.999711304s. 
[remote_addr=ccycloud-2.ozn-hbchf3oz.root.comops.site/10.140.146.6:9872]
2024-03-04 04:14:42,414 INFO 
[om62@group-009531620E4C-LeaderElection8]-org.apache.ratis.server.impl.LeaderElection:
 om62@group-009531620E4C-LeaderElection8 ELECTION round 0: result REJECTED
2024-03-04 04:14:42,414 INFO 
[om62@group-009531620E4C-LeaderElection8]-org.apache.ratis.server.RaftServer$Division:
 om62@group-009531620E4C: changes role from CANDIDATE to FOLLOWER at term 20 
for REJECTED
2024-03-04 04:14:42,414 INFO 
[om62@group-009531620E4C-LeaderElection8]-org.apache.ratis.server.impl.RoleInfo:
 om62: shutdown om62@group-009531620E4C-LeaderElection8
 {code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to