[ 
https://issues.apache.org/jira/browse/RATIS-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719611#comment-17719611
 ] 

Yaolong Liu commented on RATIS-1803:
------------------------------------

[~szetszwo] yean, I haven't fixed this yet :(    But I'm not sure if this is a 
problem caused by ratis, maybe it's something else and ratis is an innocent 
victim.

> GrpcLogAppender can't resolve host in kubernetes cluster
> --------------------------------------------------------
>
>                 Key: RATIS-1803
>                 URL: https://issues.apache.org/jira/browse/RATIS-1803
>             Project: Ratis
>          Issue Type: Bug
>    Affects Versions: 2.4.1
>            Reporter: Yaolong Liu
>            Priority: Major
>
> In a k8s container environment, the candidate cannot resolve the host of one 
> of the followers during the election process. After the election is 
> successful, the leader cannot resolve the host of the follower normally, 
> resulting in failure to send the heartbeat. Followers have been initiating 
> pre-votes but they are always rejected. After half an hour, the cluster 
> returns to normal.
> leader log:
> {code:java}
> 2023-03-02 10:26:11,909 INFO  RoleInfo - alluxio-master-1_19200: start 
> alluxio-master-1_19200@group-ABB3109A44C1-LeaderElection1
> 2023-03-02 10:26:11,910 INFO  LeaderElection - 
> alluxio-master-1_19200@group-ABB3109A44C1-LeaderElection1 PRE_VOTE round 0: 
> submit vote requests at term 0 for -1: 
> peers:[alluxio-master-0_19200|rpc:alluxio-master-0:19200|priority:0|startupRole:FOLLOWER,
>  
> alluxio-master-2_19200|rpc:alluxio-master-2:19200|priority:0|startupRole:FOLLOWER,
>  
> alluxio-master-1_19200|rpc:alluxio-master-1:19200|priority:0|startupRole:FOLLOWER]|listeners:[],
>  old=null
> 2023-03-02 10:26:11,914 INFO  RaftServerConfigKeys - 
> raft.server.rpc.first-election.timeout.min = 10000ms (fallback to 
> raft.server.rpc.timeout.min)
> 2023-03-02 10:26:11,914 INFO  RaftServerConfigKeys - 
> raft.server.rpc.first-election.timeout.max = 20000ms (fallback to 
> raft.server.rpc.timeout.max)
> 2023-03-02 10:26:11,915 INFO  GrpcServerProtocolClient - Build channel for 
> alluxio-master-0_19200
> 2023-03-02 10:26:11,915 INFO  GrpcServerProtocolClient - Build channel for 
> alluxio-master-2_19200
> 2023-03-02 10:26:11,920 INFO  LeaderElection - 
> alluxio-master-1_19200@group-ABB3109A44C1-LeaderElection1 got exception when 
> requesting votes: java.util.concurrent.ExecutionException: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: 
> Unable to resolve host alluxio-master-0
> 2023-03-02 10:26:11,945 INFO  LeaderElection - 
> alluxio-master-1_19200@group-ABB3109A44C1-LeaderElection1: PRE_VOTE PASSED 
> received 1 response(s) and 1 exception(s):
> 2023-03-02 10:26:11,945 INFO  LeaderElection -   Response 0: 
> alluxio-master-1_19200<-alluxio-master-2_19200#0:OK-t0
> 2023-03-02 10:26:11,945 INFO  LeaderElection -   Exception 1: 
> java.util.concurrent.ExecutionException: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: 
> Unable to resolve host alluxio-master-0
> 2023-03-02 10:26:11,945 INFO  LeaderElection - 
> alluxio-master-1_19200@group-ABB3109A44C1-LeaderElection1 PRE_VOTE round 0: 
> result PASSED
> 2023-03-02 10:26:11,948 INFO  LeaderElection - 
> alluxio-master-1_19200@group-ABB3109A44C1-LeaderElection1 ELECTION round 0: 
> submit vote requests at term 1 for -1: 
> peers:[alluxio-master-0_19200|rpc:alluxio-master-0:19200|priority:0|startupRole:FOLLOWER,
>  
> alluxio-master-2_19200|rpc:alluxio-master-2:19200|priority:0|startupRole:FOLLOWER,
>  
> alluxio-master-1_19200|rpc:alluxio-master-1:19200|priority:0|startupRole:FOLLOWER]|listeners:[],
>  old=null
> 2023-03-02 10:26:11,948 INFO  RaftServerConfigKeys - 
> raft.server.rpc.first-election.timeout.min = 10000ms (fallback to 
> raft.server.rpc.timeout.min)
> 2023-03-02 10:26:11,948 INFO  RaftServerConfigKeys - 
> raft.server.rpc.first-election.timeout.max = 20000ms (fallback to 
> raft.server.rpc.timeout.max)
> 2023-03-02 10:26:11,948 INFO  LeaderElection - 
> alluxio-master-1_19200@group-ABB3109A44C1-LeaderElection1 got exception when 
> requesting votes: java.util.concurrent.ExecutionException: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: 
> Unable to resolve host alluxio-master-0
> 2023-03-02 10:26:11,961 INFO  LeaderElection - 
> alluxio-master-1_19200@group-ABB3109A44C1-LeaderElection1: ELECTION PASSED 
> received 1 response(s) and 1 exception(s):
> 2023-03-02 10:26:11,961 INFO  LeaderElection -   Response 0: 
> alluxio-master-1_19200<-alluxio-master-2_19200#0:OK-t1
> 2023-03-02 10:26:11,961 INFO  LeaderElection -   Exception 1: 
> java.util.concurrent.ExecutionException: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: 
> Unable to resolve host alluxio-master-0
> 2023-03-02 10:26:11,961 INFO  LeaderElection - 
> alluxio-master-1_19200@group-ABB3109A44C1-LeaderElection1 ELECTION round 0: 
> result PASSED
> ....
> 2023-03-02 10:27:29,535 WARN  GrpcLogAppender - 
> alluxio-master-1_19200@group-ABB3109A44C1->alluxio-master-0_19200-GrpcLogAppender:
>  Leader has not got in touch with Follower 
> alluxio-master-1_19200@group-ABB3109A44C1->alluxio-master-0_19200(c-1,m0,n9, 
> attendVote=true, lastRpcSendTime=0, lastRpcResponseTime=97556) yet, just keep 
> nextIndex unchanged and retry.
> 2023-03-02 10:27:32,035 WARN  GrpcLogAppender - 
> alluxio-master-1_19200@group-ABB3109A44C1->alluxio-master-0_19200-AppendLogResponseHandler:
>  Failed appendEntries: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: 
> Unable to resolve host alluxio-master-0
> 2023-03-02 10:27:32,035 WARN  GrpcLogAppender - 
> alluxio-master-1_19200@group-ABB3109A44C1->alluxio-master-0_19200-AppendLogResponseHandler:
>  Failed appendEntries: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: 
> Unable to resolve host alluxio-master-0
> 2023-03-02 10:27:32,035 WARN  GrpcLogAppender - 
> alluxio-master-1_19200@group-ABB3109A44C1->alluxio-master-0_19200-GrpcLogAppender:
>  Leader has not got in touch with Follower 
> alluxio-master-1_19200@group-ABB3109A44C1->alluxio-master-0_19200(c-1,m0,n9, 
> attendVote=true, lastRpcSendTime=0, lastRpcResponseTime=100056) yet, just 
> keep nextIndex unchanged and retry.
> 2023-03-02 10:27:32,035 WARN  GrpcLogAppender - 
> alluxio-master-1_19200@group-ABB3109A44C1->alluxio-master-0_19200-GrpcLogAppender:
>  Leader has not got in touch with Follower 
> alluxio-master-1_19200@group-ABB3109A44C1->alluxio-master-0_19200(c-1,m0,n9, 
> attendVote=true, lastRpcSendTime=0, lastRpcResponseTime=100057) yet, just 
> keep nextIndex unchanged and retry.
> 2023-03-02 10:27:33,444 INFO  VoteContext - 
> alluxio-master-1_19200@group-ABB3109A44C1-LEADER: reject PRE_VOTE from 
> alluxio-master-0_19200: this server is the leader and still has leadership
> 2023-03-02 10:27:34,535 WARN  GrpcLogAppender - 
> alluxio-master-1_19200@group-ABB3109A44C1->alluxio-master-0_19200-AppendLogResponseHandler:
>  Failed appendEntries: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: 
> Unable to resolve host alluxio-master-0
> 2023-03-02 10:27:34,535 WARN  GrpcLogAppender - 
> alluxio-master-1_19200@group-ABB3109A44C1->alluxio-master-0_19200-AppendLogResponseHandler:
>  Failed appendEntries: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: 
> Unable to resolve host alluxio-master-0
> {code}
> The follower log:
> {code:java}
> 2023-03-02 10:27:21,985 INFO  RaftServerConfigKeys - 
> raft.server.leaderelection.pre-vote = true (default)
> 2023-03-02 10:27:21,985 INFO  RoleInfo - alluxio-master-0_19200: start 
> alluxio-master-0_19200@group-ABB3109A44C1-LeaderElection6
> 2023-03-02 10:27:21,986 INFO  LeaderElection - 
> alluxio-master-0_19200@group-ABB3109A44C1-LeaderElection6 PRE_VOTE round 0: 
> submit vote requests at term 1 for -1: 
> peers:[alluxio-master-0_19200|rpc:alluxio-master-0:19200|priority:0|startupRole:FOLLOWER,
>  
> alluxio-master-2_19200|rpc:alluxio-master-2:19200|priority:0|startupRole:FOLLOWER,
>  
> alluxio-master-1_19200|rpc:alluxio-master-1:19200|priority:0|startupRole:FOLLOWER]|listeners:[],
>  old=null
> 2023-03-02 10:27:21,987 INFO  RaftServerConfigKeys - 
> raft.server.rpc.first-election.timeout.min = 10000ms (fallback to 
> raft.server.rpc.timeout.min)
> 2023-03-02 10:27:21,987 INFO  RaftServerConfigKeys - 
> raft.server.rpc.first-election.timeout.max = 20000ms (fallback to 
> raft.server.rpc.timeout.max)
> 2023-03-02 10:27:22,019 INFO  LeaderElection - 
> alluxio-master-0_19200@group-ABB3109A44C1-LeaderElection6: PRE_VOTE REJECTED 
> received 2 response(s) and 0 exception(s):
> 2023-03-02 10:27:22,019 INFO  LeaderElection -   Response 0: 
> alluxio-master-0_19200<-alluxio-master-2_19200#0:FAIL-t1
> 2023-03-02 10:27:22,019 INFO  LeaderElection -   Response 1: 
> alluxio-master-0_19200<-alluxio-master-1_19200#0:FAIL-t1
> 2023-03-02 10:27:22,019 INFO  LeaderElection - 
> alluxio-master-0_19200@group-ABB3109A44C1-LeaderElection6 PRE_VOTE round 0: 
> result REJECTED
> 2023-03-02 10:27:22,019 INFO  RoleInfo - alluxio-master-0_19200: shutdown 
> alluxio-master-0_19200@group-ABB3109A44C1-LeaderElection6
> 2023-03-02 10:27:22,020 INFO  RoleInfo - alluxio-master-0_19200: start 
> alluxio-master-0_19200@group-ABB3109A44C1-FollowerState
> 2023-03-02 10:27:22,021 INFO  RaftServerConfigKeys - 
> raft.server.rpc.first-election.timeout.min = 10000ms (fallback to 
> raft.server.rpc.timeout.min)
> 2023-03-02 10:27:22,021 INFO  RaftServerConfigKeys - 
> raft.server.rpc.first-election.timeout.max = 20000ms (fallback to 
> raft.server.rpc.timeout.max)
> 2023-03-02 10:27:33,412 INFO  FollowerState - 
> alluxio-master-0_19200@group-ABB3109A44C1-FollowerState: change to CANDIDATE, 
> lastRpcElapsedTime:11392399572ns, electionTimeout:11391ms
> 2023-03-02 10:27:33,412 INFO  RoleInfo - alluxio-master-0_19200: shutdown 
> alluxio-master-0_19200@group-ABB3109A44C1-FollowerState
> 2023-03-02 10:27:33,413 INFO  RaftServerConfigKeys - 
> raft.server.leaderelection.pre-vote = true (default)
> 2023-03-02 10:27:33,413 INFO  RoleInfo - alluxio-master-0_19200: start 
> alluxio-master-0_19200@group-ABB3109A44C1-LeaderElection7
> 2023-03-02 10:27:33,414 INFO  LeaderElection - 
> alluxio-master-0_19200@group-ABB3109A44C1-LeaderElection7 PRE_VOTE round 0: 
> submit vote requests at term 1 for -1: 
> peers:[alluxio-master-0_19200|rpc:alluxio-master-0:19200|priority:0|startupRole:FOLLOWER,
>  
> alluxio-master-2_19200|rpc:alluxio-master-2:19200|priority:0|startupRole:FOLLOWER,
>  
> alluxio-master-1_19200|rpc:alluxio-master-1:19200|priority:0|startupRole:FOLLOWER]|listeners:[],
>  old=null
> 2023-03-02 10:27:33,414 INFO  RaftServerConfigKeys - 
> raft.server.rpc.first-election.timeout.min = 10000ms (fallback to 
> raft.server.rpc.timeout.min)
> 2023-03-02 10:27:33,415 INFO  RaftServerConfigKeys - 
> raft.server.rpc.first-election.timeout.max = 20000ms (fallback to 
> raft.server.rpc.timeout.max)
> 2023-03-02 10:27:33,446 INFO  LeaderElection - 
> alluxio-master-0_19200@group-ABB3109A44C1-LeaderElection7: PRE_VOTE REJECTED 
> received 2 response(s) and 0 exception(s):
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to