[
https://issues.apache.org/jira/browse/KUDU-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304449#comment-17304449
]
Bankim Bhavsar commented on KUDU-3266:
--------------------------------------
Examined another failure in
ParameterizedRecoverMasterTest.TestRecoverDeadMasterSysCatalogCopy/1, where
GetParam() = 3
6fb is the leader that's paused and cba becomes the leader.
cd7 is the follower.
{noformat}
I0317 09:18:20.449790 21057 raft_consensus.cc:479] T
00000000000000000000000000000000 P cba6c57c99f44acfabfb650e6cb94d06 [term 2
FOLLOWER]: Starting leader election (detected failure of leader
6fb6e93836bb45ae882edd7e0d26c852)
I0317 09:18:20.449846 21057 raft_consensus.cc:3032] T
00000000000000000000000000000000 P cba6c57c99f44acfabfb650e6cb94d06 [term 2
FOLLOWER]: Advancing to term 3
I0317 09:18:20.459255 21057 raft_consensus.cc:683] T
00000000000000000000000000000000 P cba6c57c99f44acfabfb650e6cb94d06 [term 3
LEADER]: Becoming Leader. State: Replica: cba6c57c99f44acfabfb650e6cb94d06,
State: Running, Role: LEADER
I0317 09:18:20.841820 21058 sys_catalog.cc:434] T
00000000000000000000000000000000 P cd7cbe8654e7426ca818c1c667cef824
[sys.catalog]: SysCatalogTable state changed. Reason: New leader
cba6c57c99f44acfabfb650e6cb94d06. Latest consensus state: current_term: 3
leader_uuid: "cba6c57c99f44acfabfb650e6cb94d06" committed_config { opid_index:
2860 OBSOLETE_local: false peers { permanent_uuid:
"cba6c57c99f44acfabfb650e6cb94d06" member_type: VOTER last_known_addr { host:
"127.0.92.125" port: 42749 } } peers { permanent_uuid:
"cd7cbe8654e7426ca818c1c667cef824" member_type: VOTER last_known_addr { host:
"127.0.92.124" port: 35709 } } peers { permanent_uuid:
"6fb6e93836bb45ae882edd7e0d26c852" member_type: VOTER last_known_addr { host:
"127.0.92.126" port: 41791 } attrs { promote: false } } }
I0317 09:18:20.842010 21058 sys_catalog.cc:437] T
00000000000000000000000000000000 P cd7cbe8654e7426ca818c1c667cef824
[sys.catalog]: This master's current role is: FOLLOWER
{noformat}
Table creation request which could have been replicated to leader cba and
follower cd7 but not 6fb.
{noformat}
I0317 09:18:22.128876 17932 catalog_manager.cc:1617] Servicing CreateTable
request from {username='slave'} at 127.0.0.1:47764:
name: "table-0"
{noformat}
Looks like previous leader 6fb is up and leader cba is paused
{noformat}
I0317 09:18:22.342823 18013 raft_consensus.cc:1223] T
00000000000000000000000000000000 P cd7cbe8654e7426ca818c1c667cef824 [term 3
FOLLOWER]: Rejecting Update request from peer 6fb6e93836bb45ae882edd7e0d26c852
for earlier term 2. Current term is 3. Ops: []
{noformat}
Open table request fails since it likely went to 6fb which is leader from
previous term among itself and follower cd7.
{noformat}
Bad status: Not found: Unable to open table: the table does not exist:
table_name: "table-0"
/data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/master/dynamic_multi_master-test.cc:603:
Failure
Expected: cv.CheckRowCount(table_name, ClusterVerifier::EXACTLY, 0) doesn't
generate new fatal failures in the current thread.
Actual: it does
{noformat}
Moments later 6fb steps down as cd7 is paused and cba (term 3) becomes the
leader.
{noformat}
W0317 09:18:22.404738 17916 leader_election.cc:334] T
00000000000000000000000000000000 P cba6c57c99f44acfabfb650e6cb94d06
[CANDIDATE]: Term 3 pre-election: RPC error from VoteRequest() call to peer
6fb6e93836bb45ae882edd7e0d26c852 (127.0.92.126:41791): Timed out: connection
negotiation to 127.0.92.126:41791 for RPC RequestConsensusVote timed out after
1.923s (ON_OUTBOUND_QUEUE)
I0317 09:18:22.407025 17942 raft_consensus.cc:1223] T
00000000000000000000000000000000 P cba6c57c99f44acfabfb650e6cb94d06 [term 3
LEADER]: Rejecting Update request from peer 6fb6e93836bb45ae882edd7e0d26c852
for earlier term 2. Current term is 3. Ops: []
I0317 09:18:22.411103 20992 consensus_queue.cc:1038] T
00000000000000000000000000000000 P 6fb6e93836bb45ae882edd7e0d26c852 [LEADER]:
Peer responded invalid term: Peer: permanent_uuid:
"cd7cbe8654e7426ca818c1c667cef824" member_type: VOTER last_known_addr { host:
"127.0.92.124" port: 35709 }, Status: INVALID_TERM, Last received: 2.3568, Next
index: 3569, Last known committed idx: 3572, Time since last communication:
0.000s
I0317 09:18:22.411545 21016 raft_consensus.cc:3027] T
00000000000000000000000000000000 P 6fb6e93836bb45ae882edd7e0d26c852 [term 2
LEADER]: Stepping down as leader of term 2
I0317 09:18:22.411592 21016 raft_consensus.cc:726] T
00000000000000000000000000000000 P 6fb6e93836bb45ae882edd7e0d26c852 [term 2
LEADER]: Becoming Follower/Learner. State: Replica:
6fb6e93836bb45ae882edd7e0d26c852, State: Running, Role: LEADER
{noformat}
> Flakiness in dynamic_multi_master_test in VerifyClusterAfterMasterAddition()
> function
> -------------------------------------------------------------------------------------
>
> Key: KUDU-3266
> URL: https://issues.apache.org/jira/browse/KUDU-3266
> Project: Kudu
> Issue Type: Test
> Components: master, test
> Affects Versions: 1.15.0
> Reporter: Bankim Bhavsar
> Assignee: Bankim Bhavsar
> Priority: Major
>
> {noformat}
> ParameterizedRecoverMasterTest.TestRecoverDeadMasterSysCatalogCopy/1:
> /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/integration-tests/cluster_verifier.cc:119:
> Failure
> Failed
> Bad status: Not found: Unable to open table: the table does not exist:
> table_name: "table-1"
> /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/master/dynamic_multi_master-test.cc:603:
> Failure
> Expected: cv.CheckRowCount(table_name, ClusterVerifier::EXACTLY, 0) doesn't
> generate new fatal failures in the current thread.
> Actual: it does.
> 2021-03-17T17:04:19Z chronyd exiting
> /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/master/dynamic_multi_master-test.cc:1099:
> Failure
> Expected: VerifyClusterAfterMasterAddition(master_hps, orig_num_masters_)
> doesn't generate new fatal failures in the current thread.
> Actual: it does.
> {noformat}
> Although the same verification function is used by other tests for add
> master, this flakiness started showing up after introduction of the
> RecoverDeadMaster test.
> https://github.com/apache/kudu/commit/4b4a8c0f2fdfd15524510821b27fc9c3b5d26b6b
--
This message was sent by Atlassian Jira
(v8.3.4#803005)