Adar Dembo created KUDU-3092:
--------------------------------

             Summary: 
MultiMasterIdleConnectionsITest.ClientReacquiresAuthnToken is flaky
                 Key: KUDU-3092
                 URL: https://issues.apache.org/jira/browse/KUDU-3092
             Project: Kudu
          Issue Type: Bug
          Components: test
    Affects Versions: 1.12.0
            Reporter: Adar Dembo
         Attachments: auth_token_expire-itest.txt.gz

There's code in this test to force leadership to transfer from one master to 
another:

{noformat}
    int former_leader_master_idx;
    ASSERT_OK(cluster_->GetLeaderMasterIndex(&former_leader_master_idx));
    const int leader_idx = (former_leader_master_idx + 1) % num_masters_;
    ASSERT_EVENTUALLY([&] {
      consensus::ConsensusServiceProxy proxy(
          cluster_->messenger(), cluster_->master(leader_idx)->bound_rpc_addr(),
          cluster_->master(leader_idx)->bound_rpc_hostport().host());
      consensus::RunLeaderElectionRequestPB req;
      req.set_tablet_id(master::SysCatalogTable::kSysCatalogTabletId);
      req.set_dest_uuid(cluster_->master(leader_idx)->uuid());
      rpc::RpcController rpc;
      rpc.set_timeout(MonoDelta::FromSeconds(1));
      consensus::RunLeaderElectionResponsePB resp;
      ASSERT_OK(proxy.RunLeaderElection(req, &resp, &rpc));
      int idx;
      ASSERT_OK(cluster_->GetLeaderMasterIndex(&idx));
      ASSERT_NE(former_leader_master_idx, idx);
    });
{noformat}

Unfortunately, I think that code is flaky: leadership could naturally transfer 
from former_leader_master_idx to leader_idx after the call to 
GetLeaderMasterIndex but before the call to RunLeaderElection. If that happens, 
the election sequence will no-op, the leader master index never changes, and 
eventually, the ASSERT_EVENTUALLY fails.

I have a failed test run that corroborates this:
{noformat}
...
<repeats a bunch of times>
I0324 01:57:59.504002 25803 tablet_service.cc:1467] Received Run Leader 
Election RPC: tablet_id: "00000000000000000000000000000000"
dest_uuid: "73bc7d22b7044c94a045e79cb2f31c57"
 from {username='test-admin', principal='[email protected]'} at 
127.0.0.1:37048
I0324 01:57:59.504032 25803 raft_consensus.cc:462] T 
00000000000000000000000000000000 P 73bc7d22b7044c94a045e79cb2f31c57 [term 1 
LEADER]: Not starting forced leader election -- already a leader
I0324 01:58:00.504957 25803 tablet_service.cc:1467] Received Run Leader 
Election RPC: tablet_id: "00000000000000000000000000000000"
dest_uuid: "73bc7d22b7044c94a045e79cb2f31c57"
 from {username='test-admin', principal='[email protected]'} at 
127.0.0.1:37048
I0324 01:58:00.504988 25803 raft_consensus.cc:462] T 
00000000000000000000000000000000 P 73bc7d22b7044c94a045e79cb2f31c57 [term 1 
LEADER]: Not starting forced leader election -- already a leader
/home/jenkins-slave/workspace/kudu-master/3/src/kudu/integration-tests/auth_token_expire-itest.cc:588:
 Failure
Expected: (former_leader_master_idx) != (idx), actual: 0 vs 0
/home/jenkins-slave/workspace/kudu-master/3/src/kudu/util/test_util.cc:348: 
Failure
Failed
Timed out waiting for assertion to pass.
{noformat}

Attaching the full test log.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to