[
https://issues.apache.org/jira/browse/RATIS-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695987#comment-17695987
]
Tsz-wo Sze commented on RATIS-1803:
-----------------------------------
[~liuyaolong], would it be related to dns timeout? I guess you should have
already checked this
https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/
> GrpcLogAppender can't resolve host in kubernetes cluster
> --------------------------------------------------------
>
> Key: RATIS-1803
> URL: https://issues.apache.org/jira/browse/RATIS-1803
> Project: Ratis
> Issue Type: Bug
> Affects Versions: 2.4.1
> Reporter: Yaolong Liu
> Priority: Major
>
> In a k8s container environment, the candidate cannot resolve the host of one
> of the followers during the election process. After the election is
> successful, the leader cannot resolve the host of the follower normally,
> resulting in failure to send the heartbeat. Followers have been initiating
> pre-votes but they are always rejected. After half an hour, the cluster
> returns to normal.
> leader log:
> {code:java}
> 2023-03-02 10:26:11,909 INFO RoleInfo - alluxio-master-1_19200: start
> alluxio-master-1_19200@group-ABB3109A44C1-LeaderElection1
> 2023-03-02 10:26:11,910 INFO LeaderElection -
> alluxio-master-1_19200@group-ABB3109A44C1-LeaderElection1 PRE_VOTE round 0:
> submit vote requests at term 0 for -1:
> peers:[alluxio-master-0_19200|rpc:alluxio-master-0:19200|priority:0|startupRole:FOLLOWER,
>
> alluxio-master-2_19200|rpc:alluxio-master-2:19200|priority:0|startupRole:FOLLOWER,
>
> alluxio-master-1_19200|rpc:alluxio-master-1:19200|priority:0|startupRole:FOLLOWER]|listeners:[],
> old=null
> 2023-03-02 10:26:11,914 INFO RaftServerConfigKeys -
> raft.server.rpc.first-election.timeout.min = 10000ms (fallback to
> raft.server.rpc.timeout.min)
> 2023-03-02 10:26:11,914 INFO RaftServerConfigKeys -
> raft.server.rpc.first-election.timeout.max = 20000ms (fallback to
> raft.server.rpc.timeout.max)
> 2023-03-02 10:26:11,915 INFO GrpcServerProtocolClient - Build channel for
> alluxio-master-0_19200
> 2023-03-02 10:26:11,915 INFO GrpcServerProtocolClient - Build channel for
> alluxio-master-2_19200
> 2023-03-02 10:26:11,920 INFO LeaderElection -
> alluxio-master-1_19200@group-ABB3109A44C1-LeaderElection1 got exception when
> requesting votes: java.util.concurrent.ExecutionException:
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE:
> Unable to resolve host alluxio-master-0
> 2023-03-02 10:26:11,945 INFO LeaderElection -
> alluxio-master-1_19200@group-ABB3109A44C1-LeaderElection1: PRE_VOTE PASSED
> received 1 response(s) and 1 exception(s):
> 2023-03-02 10:26:11,945 INFO LeaderElection - Response 0:
> alluxio-master-1_19200<-alluxio-master-2_19200#0:OK-t0
> 2023-03-02 10:26:11,945 INFO LeaderElection - Exception 1:
> java.util.concurrent.ExecutionException:
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE:
> Unable to resolve host alluxio-master-0
> 2023-03-02 10:26:11,945 INFO LeaderElection -
> alluxio-master-1_19200@group-ABB3109A44C1-LeaderElection1 PRE_VOTE round 0:
> result PASSED
> 2023-03-02 10:26:11,948 INFO LeaderElection -
> alluxio-master-1_19200@group-ABB3109A44C1-LeaderElection1 ELECTION round 0:
> submit vote requests at term 1 for -1:
> peers:[alluxio-master-0_19200|rpc:alluxio-master-0:19200|priority:0|startupRole:FOLLOWER,
>
> alluxio-master-2_19200|rpc:alluxio-master-2:19200|priority:0|startupRole:FOLLOWER,
>
> alluxio-master-1_19200|rpc:alluxio-master-1:19200|priority:0|startupRole:FOLLOWER]|listeners:[],
> old=null
> 2023-03-02 10:26:11,948 INFO RaftServerConfigKeys -
> raft.server.rpc.first-election.timeout.min = 10000ms (fallback to
> raft.server.rpc.timeout.min)
> 2023-03-02 10:26:11,948 INFO RaftServerConfigKeys -
> raft.server.rpc.first-election.timeout.max = 20000ms (fallback to
> raft.server.rpc.timeout.max)
> 2023-03-02 10:26:11,948 INFO LeaderElection -
> alluxio-master-1_19200@group-ABB3109A44C1-LeaderElection1 got exception when
> requesting votes: java.util.concurrent.ExecutionException:
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE:
> Unable to resolve host alluxio-master-0
> 2023-03-02 10:26:11,961 INFO LeaderElection -
> alluxio-master-1_19200@group-ABB3109A44C1-LeaderElection1: ELECTION PASSED
> received 1 response(s) and 1 exception(s):
> 2023-03-02 10:26:11,961 INFO LeaderElection - Response 0:
> alluxio-master-1_19200<-alluxio-master-2_19200#0:OK-t1
> 2023-03-02 10:26:11,961 INFO LeaderElection - Exception 1:
> java.util.concurrent.ExecutionException:
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE:
> Unable to resolve host alluxio-master-0
> 2023-03-02 10:26:11,961 INFO LeaderElection -
> alluxio-master-1_19200@group-ABB3109A44C1-LeaderElection1 ELECTION round 0:
> result PASSED
> ....
> 2023-03-02 10:27:29,535 WARN GrpcLogAppender -
> alluxio-master-1_19200@group-ABB3109A44C1->alluxio-master-0_19200-GrpcLogAppender:
> Leader has not got in touch with Follower
> alluxio-master-1_19200@group-ABB3109A44C1->alluxio-master-0_19200(c-1,m0,n9,
> attendVote=true, lastRpcSendTime=0, lastRpcResponseTime=97556) yet, just keep
> nextIndex unchanged and retry.
> 2023-03-02 10:27:32,035 WARN GrpcLogAppender -
> alluxio-master-1_19200@group-ABB3109A44C1->alluxio-master-0_19200-AppendLogResponseHandler:
> Failed appendEntries:
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE:
> Unable to resolve host alluxio-master-0
> 2023-03-02 10:27:32,035 WARN GrpcLogAppender -
> alluxio-master-1_19200@group-ABB3109A44C1->alluxio-master-0_19200-AppendLogResponseHandler:
> Failed appendEntries:
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE:
> Unable to resolve host alluxio-master-0
> 2023-03-02 10:27:32,035 WARN GrpcLogAppender -
> alluxio-master-1_19200@group-ABB3109A44C1->alluxio-master-0_19200-GrpcLogAppender:
> Leader has not got in touch with Follower
> alluxio-master-1_19200@group-ABB3109A44C1->alluxio-master-0_19200(c-1,m0,n9,
> attendVote=true, lastRpcSendTime=0, lastRpcResponseTime=100056) yet, just
> keep nextIndex unchanged and retry.
> 2023-03-02 10:27:32,035 WARN GrpcLogAppender -
> alluxio-master-1_19200@group-ABB3109A44C1->alluxio-master-0_19200-GrpcLogAppender:
> Leader has not got in touch with Follower
> alluxio-master-1_19200@group-ABB3109A44C1->alluxio-master-0_19200(c-1,m0,n9,
> attendVote=true, lastRpcSendTime=0, lastRpcResponseTime=100057) yet, just
> keep nextIndex unchanged and retry.
> 2023-03-02 10:27:33,444 INFO VoteContext -
> alluxio-master-1_19200@group-ABB3109A44C1-LEADER: reject PRE_VOTE from
> alluxio-master-0_19200: this server is the leader and still has leadership
> 2023-03-02 10:27:34,535 WARN GrpcLogAppender -
> alluxio-master-1_19200@group-ABB3109A44C1->alluxio-master-0_19200-AppendLogResponseHandler:
> Failed appendEntries:
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE:
> Unable to resolve host alluxio-master-0
> 2023-03-02 10:27:34,535 WARN GrpcLogAppender -
> alluxio-master-1_19200@group-ABB3109A44C1->alluxio-master-0_19200-AppendLogResponseHandler:
> Failed appendEntries:
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE:
> Unable to resolve host alluxio-master-0
> {code}
> The follower log:
> {code:java}
> 2023-03-02 10:27:21,985 INFO RaftServerConfigKeys -
> raft.server.leaderelection.pre-vote = true (default)
> 2023-03-02 10:27:21,985 INFO RoleInfo - alluxio-master-0_19200: start
> alluxio-master-0_19200@group-ABB3109A44C1-LeaderElection6
> 2023-03-02 10:27:21,986 INFO LeaderElection -
> alluxio-master-0_19200@group-ABB3109A44C1-LeaderElection6 PRE_VOTE round 0:
> submit vote requests at term 1 for -1:
> peers:[alluxio-master-0_19200|rpc:alluxio-master-0:19200|priority:0|startupRole:FOLLOWER,
>
> alluxio-master-2_19200|rpc:alluxio-master-2:19200|priority:0|startupRole:FOLLOWER,
>
> alluxio-master-1_19200|rpc:alluxio-master-1:19200|priority:0|startupRole:FOLLOWER]|listeners:[],
> old=null
> 2023-03-02 10:27:21,987 INFO RaftServerConfigKeys -
> raft.server.rpc.first-election.timeout.min = 10000ms (fallback to
> raft.server.rpc.timeout.min)
> 2023-03-02 10:27:21,987 INFO RaftServerConfigKeys -
> raft.server.rpc.first-election.timeout.max = 20000ms (fallback to
> raft.server.rpc.timeout.max)
> 2023-03-02 10:27:22,019 INFO LeaderElection -
> alluxio-master-0_19200@group-ABB3109A44C1-LeaderElection6: PRE_VOTE REJECTED
> received 2 response(s) and 0 exception(s):
> 2023-03-02 10:27:22,019 INFO LeaderElection - Response 0:
> alluxio-master-0_19200<-alluxio-master-2_19200#0:FAIL-t1
> 2023-03-02 10:27:22,019 INFO LeaderElection - Response 1:
> alluxio-master-0_19200<-alluxio-master-1_19200#0:FAIL-t1
> 2023-03-02 10:27:22,019 INFO LeaderElection -
> alluxio-master-0_19200@group-ABB3109A44C1-LeaderElection6 PRE_VOTE round 0:
> result REJECTED
> 2023-03-02 10:27:22,019 INFO RoleInfo - alluxio-master-0_19200: shutdown
> alluxio-master-0_19200@group-ABB3109A44C1-LeaderElection6
> 2023-03-02 10:27:22,020 INFO RoleInfo - alluxio-master-0_19200: start
> alluxio-master-0_19200@group-ABB3109A44C1-FollowerState
> 2023-03-02 10:27:22,021 INFO RaftServerConfigKeys -
> raft.server.rpc.first-election.timeout.min = 10000ms (fallback to
> raft.server.rpc.timeout.min)
> 2023-03-02 10:27:22,021 INFO RaftServerConfigKeys -
> raft.server.rpc.first-election.timeout.max = 20000ms (fallback to
> raft.server.rpc.timeout.max)
> 2023-03-02 10:27:33,412 INFO FollowerState -
> alluxio-master-0_19200@group-ABB3109A44C1-FollowerState: change to CANDIDATE,
> lastRpcElapsedTime:11392399572ns, electionTimeout:11391ms
> 2023-03-02 10:27:33,412 INFO RoleInfo - alluxio-master-0_19200: shutdown
> alluxio-master-0_19200@group-ABB3109A44C1-FollowerState
> 2023-03-02 10:27:33,413 INFO RaftServerConfigKeys -
> raft.server.leaderelection.pre-vote = true (default)
> 2023-03-02 10:27:33,413 INFO RoleInfo - alluxio-master-0_19200: start
> alluxio-master-0_19200@group-ABB3109A44C1-LeaderElection7
> 2023-03-02 10:27:33,414 INFO LeaderElection -
> alluxio-master-0_19200@group-ABB3109A44C1-LeaderElection7 PRE_VOTE round 0:
> submit vote requests at term 1 for -1:
> peers:[alluxio-master-0_19200|rpc:alluxio-master-0:19200|priority:0|startupRole:FOLLOWER,
>
> alluxio-master-2_19200|rpc:alluxio-master-2:19200|priority:0|startupRole:FOLLOWER,
>
> alluxio-master-1_19200|rpc:alluxio-master-1:19200|priority:0|startupRole:FOLLOWER]|listeners:[],
> old=null
> 2023-03-02 10:27:33,414 INFO RaftServerConfigKeys -
> raft.server.rpc.first-election.timeout.min = 10000ms (fallback to
> raft.server.rpc.timeout.min)
> 2023-03-02 10:27:33,415 INFO RaftServerConfigKeys -
> raft.server.rpc.first-election.timeout.max = 20000ms (fallback to
> raft.server.rpc.timeout.max)
> 2023-03-02 10:27:33,446 INFO LeaderElection -
> alluxio-master-0_19200@group-ABB3109A44C1-LeaderElection7: PRE_VOTE REJECTED
> received 2 response(s) and 0 exception(s):
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)