Adar Dembo created KUDU-3092:
--------------------------------
Summary:
MultiMasterIdleConnectionsITest.ClientReacquiresAuthnToken is flaky
Key: KUDU-3092
URL: https://issues.apache.org/jira/browse/KUDU-3092
Project: Kudu
Issue Type: Bug
Components: test
Affects Versions: 1.12.0
Reporter: Adar Dembo
Attachments: auth_token_expire-itest.txt.gz
There's code in this test to force leadership to transfer from one master to
another:
{noformat}
int former_leader_master_idx;
ASSERT_OK(cluster_->GetLeaderMasterIndex(&former_leader_master_idx));
const int leader_idx = (former_leader_master_idx + 1) % num_masters_;
ASSERT_EVENTUALLY([&] {
consensus::ConsensusServiceProxy proxy(
cluster_->messenger(), cluster_->master(leader_idx)->bound_rpc_addr(),
cluster_->master(leader_idx)->bound_rpc_hostport().host());
consensus::RunLeaderElectionRequestPB req;
req.set_tablet_id(master::SysCatalogTable::kSysCatalogTabletId);
req.set_dest_uuid(cluster_->master(leader_idx)->uuid());
rpc::RpcController rpc;
rpc.set_timeout(MonoDelta::FromSeconds(1));
consensus::RunLeaderElectionResponsePB resp;
ASSERT_OK(proxy.RunLeaderElection(req, &resp, &rpc));
int idx;
ASSERT_OK(cluster_->GetLeaderMasterIndex(&idx));
ASSERT_NE(former_leader_master_idx, idx);
});
{noformat}
Unfortunately, I think that code is flaky: leadership could naturally transfer
from former_leader_master_idx to leader_idx after the call to
GetLeaderMasterIndex but before the call to RunLeaderElection. If that happens,
the election sequence will no-op, the leader master index never changes, and
eventually, the ASSERT_EVENTUALLY fails.
I have a failed test run that corroborates this:
{noformat}
...
<repeats a bunch of times>
I0324 01:57:59.504002 25803 tablet_service.cc:1467] Received Run Leader
Election RPC: tablet_id: "00000000000000000000000000000000"
dest_uuid: "73bc7d22b7044c94a045e79cb2f31c57"
from {username='test-admin', principal='[email protected]'} at
127.0.0.1:37048
I0324 01:57:59.504032 25803 raft_consensus.cc:462] T
00000000000000000000000000000000 P 73bc7d22b7044c94a045e79cb2f31c57 [term 1
LEADER]: Not starting forced leader election -- already a leader
I0324 01:58:00.504957 25803 tablet_service.cc:1467] Received Run Leader
Election RPC: tablet_id: "00000000000000000000000000000000"
dest_uuid: "73bc7d22b7044c94a045e79cb2f31c57"
from {username='test-admin', principal='[email protected]'} at
127.0.0.1:37048
I0324 01:58:00.504988 25803 raft_consensus.cc:462] T
00000000000000000000000000000000 P 73bc7d22b7044c94a045e79cb2f31c57 [term 1
LEADER]: Not starting forced leader election -- already a leader
/home/jenkins-slave/workspace/kudu-master/3/src/kudu/integration-tests/auth_token_expire-itest.cc:588:
Failure
Expected: (former_leader_master_idx) != (idx), actual: 0 vs 0
/home/jenkins-slave/workspace/kudu-master/3/src/kudu/util/test_util.cc:348:
Failure
Failed
Timed out waiting for assertion to pass.
{noformat}
Attaching the full test log.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)