[
https://issues.apache.org/jira/browse/KUDU-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304422#comment-17304422
]
Bankim Bhavsar commented on KUDU-3266:
--------------------------------------
Here is the analysis around what's happening among 3 masters as one of them is
being paused and CreateTable request and OpenTable request in next iteration.
https://github.com/apache/kudu/blob/master/src/kudu/master/dynamic_multi_master-test.cc#L598-L610
{code}
LOG(INFO) << "Pausing and resuming individual masters";
string table_name = kTableName;
for (int i = 0; i < expected_num_masters; i++) {
ASSERT_OK(migrated_cluster.master(i)->Pause());
cluster::ScopedResumeExternalDaemon
resume_daemon(migrated_cluster.master(i));
NO_FATALS(cv.CheckRowCount(table_name, ClusterVerifier::EXACTLY, 0));
// See MasterFailoverTest.TestCreateTableSync to understand why we must
// check for IsAlreadyPresent as well.
table_name = Substitute("table-$0", i);
Status s = CreateTable(&migrated_cluster, table_name);
ASSERT_TRUE(s.ok() || s.IsAlreadyPresent());
}
{code}
Consider 3 masters A, B, C.
- A is the leader
- A gets paused
- B becomes the leader
- Create table request which gets propagated to B and C forming a quorum.
- Now A is resumed
- While A is coming back up, B is paused.
- C becomes candidate and tries to become leader asking for vote from A. But A
itself was the leader before it was paused and for some reason doesn't vote.
- Open table request now goes to table A (the leader) and gets table not found
error because A didn't receive the create table request when it was down.
- Moments later B resumes (which was leader before it was paused) and wins the
election and A steps down. But by this time the test has failed.
> Flakiness in dynamic_multi_master_test in VerifyClusterAfterMasterAddition()
> function
> -------------------------------------------------------------------------------------
>
> Key: KUDU-3266
> URL: https://issues.apache.org/jira/browse/KUDU-3266
> Project: Kudu
> Issue Type: Test
> Components: master, test
> Affects Versions: 1.15.0
> Reporter: Bankim Bhavsar
> Assignee: Bankim Bhavsar
> Priority: Major
>
> {noformat}
> ParameterizedRecoverMasterTest.TestRecoverDeadMasterSysCatalogCopy/1:
> /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/integration-tests/cluster_verifier.cc:119:
> Failure
> Failed
> Bad status: Not found: Unable to open table: the table does not exist:
> table_name: "table-1"
> /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/master/dynamic_multi_master-test.cc:603:
> Failure
> Expected: cv.CheckRowCount(table_name, ClusterVerifier::EXACTLY, 0) doesn't
> generate new fatal failures in the current thread.
> Actual: it does.
> 2021-03-17T17:04:19Z chronyd exiting
> /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/master/dynamic_multi_master-test.cc:1099:
> Failure
> Expected: VerifyClusterAfterMasterAddition(master_hps, orig_num_masters_)
> doesn't generate new fatal failures in the current thread.
> Actual: it does.
> {noformat}
> Although the same verification function is used by other tests for add
> master, this flakiness started showing up after introduction of the
> RecoverDeadMaster test.
> https://github.com/apache/kudu/commit/4b4a8c0f2fdfd15524510821b27fc9c3b5d26b6b
--
This message was sent by Atlassian Jira
(v8.3.4#803005)