Todd Lipcon has submitted this change and it was merged.

Change subject: KUDU-1548. Fix flaky 
RaftConsensusITest.TestReplaceChangeConfigOperation
......................................................................


KUDU-1548. Fix flaky RaftConsensusITest.TestReplaceChangeConfigOperation

This test was occasionally failing its 2nd election in TSAN mode because
we were resuming the previous leader before the new leader could be
elected. Sometimes the previous leader was fast enough to replicate its
pending config change to a majority of nodes before the new candidate
could send out its election RPC, thus violating the underlying
assumptions of this test.

I also added a minor C++11 syntax-only change in the cluster itest utils
class as a part of this patch (doesn't change any behavior).

Before this fix, this test failed 15/800 times on the dist-test cluster.
After this change, it passed 100% of the time.

The test log looked something like this:

I0729 18:59:47.834403 11544 raft_consensus.cc:370] T 
e3503c47a21649ca931234999cd0bb45 P d4f64819170a4cf78fe4c9e9a72ec4b9 [term 1 
FOLLOWER]: No leader contacted us within the election timeout. Triggering 
leader election
I0729 18:59:47.834686 11544 raft_consensus.cc:2019] T 
e3503c47a21649ca931234999cd0bb45 P d4f64819170a4cf78fe4c9e9a72ec4b9 [term 1 
FOLLOWER]: Advancing to term 2
I0729 18:59:47.840427 11544 leader_election.cc:223] T 
e3503c47a21649ca931234999cd0bb45 P d4f64819170a4cf78fe4c9e9a72ec4b9 
[CANDIDATE]: Term 2 election: Requesting vote from peer 
54197053abab4b6cb1b1632c9d1062dc
I0729 18:59:47.840860 11544 leader_election.cc:223] T 
e3503c47a21649ca931234999cd0bb45 P d4f64819170a4cf78fe4c9e9a72ec4b9 
[CANDIDATE]: Term 2 election: Requesting vote from peer 
3522a8de8170476dba0beb58cb2150d4
I0729 18:59:47.872720 11669 raft_consensus.cc:869] T 
e3503c47a21649ca931234999cd0bb45 P 3522a8de8170476dba0beb58cb2150d4 [term 1 
FOLLOWER]: Refusing update from remote peer 54197053abab4b6cb1b1632c9d1062dc: 
Log matching property violated. Preceding OpId in replica: term: 1 index: 1. 
Preceding OpId from leader: term: 1 index: 2. (index mismatch)
I0729 18:59:47.874522 11454 consensus_queue.cc:578] T 
e3503c47a21649ca931234999cd0bb45 P 54197053abab4b6cb1b1632c9d1062dc [LEADER]: 
Connected to new peer: Peer: 3522a8de8170476dba0beb58cb2150d4, Is new: false, 
Last received: 1.1, Next index: 2, Last known committed idx: 1, Last exchange 
result: ERROR, Needs remote bootstrap: false
I0729 18:59:47.878105 11150 raft_consensus.cc:1324] T 
e3503c47a21649ca931234999cd0bb45 P 54197053abab4b6cb1b1632c9d1062dc [term 1 
LEADER]: Handling vote request from an unknown peer 
d4f64819170a4cf78fe4c9e9a72ec4b9
I0729 18:59:47.878290 11150 raft_consensus.cc:2014] T 
e3503c47a21649ca931234999cd0bb45 P 54197053abab4b6cb1b1632c9d1062dc [term 1 
LEADER]: Stepping down as leader of term 1
I0729 18:59:47.878451 11150 raft_consensus.cc:499] T 
e3503c47a21649ca931234999cd0bb45 P 54197053abab4b6cb1b1632c9d1062dc [term 1 
LEADER]: Becoming Follower/Learner. State: Replica: 
54197053abab4b6cb1b1632c9d1062dc, State: 1, Role: LEADER
Watermarks: {Received: term: 1 index: 2 Committed: term: 1 index: 1}
I0729 18:59:47.878968 11150 consensus_queue.cc:162] T 
e3503c47a21649ca931234999cd0bb45 P 54197053abab4b6cb1b1632c9d1062dc 
[NON_LEADER]: Queue going to NON_LEADER mode. State: All replicated op: 0.0, 
Majority replicated op: 1.1, Committed index: 1.1, Last appended: 1.2, Current 
term: 1, Majority size: -1, State: 1, Mode: NON_LEADER
I0729 18:59:47.879871 11150 consensus_peers.cc:358] T 
e3503c47a21649ca931234999cd0bb45 P 54197053abab4b6cb1b1632c9d1062dc -> Peer 
3522a8de8170476dba0beb58cb2150d4 (127.37.56.2:53243): Closing peer: 
3522a8de8170476dba0beb58cb2150d4
I0729 18:59:47.882057 11150 raft_consensus.cc:2019] T 
e3503c47a21649ca931234999cd0bb45 P 54197053abab4b6cb1b1632c9d1062dc [term 1 
FOLLOWER]: Advancing to term 2
I0729 18:59:47.885711 11150 raft_consensus.cc:1626] T 
e3503c47a21649ca931234999cd0bb45 P 54197053abab4b6cb1b1632c9d1062dc [term 2 
FOLLOWER]: Leader election vote request: Denying vote to candidate 
d4f64819170a4cf78fe4c9e9a72ec4b9 for term 2 because replica has last-logged 
OpId of term: 1 index: 2, which is greater than that of the candidate, which 
has last-logged OpId of term: 1 index: 1.
I0729 18:59:47.892060 11477 leader_election.cc:361] T 
e3503c47a21649ca931234999cd0bb45 P d4f64819170a4cf78fe4c9e9a72ec4b9 
[CANDIDATE]: Term 2 election: Vote denied by peer 
54197053abab4b6cb1b1632c9d1062dc. Message: Invalid argument: T 
e3503c47a21649ca931234999cd0bb45 P 54197053abab4b6cb1b1632c9d1062dc [term 2 
FOLLOWER]: Leader election vote request: Denying vote to candidate 
d4f64819170a4cf78fe4c9e9a72ec4b9 for term 2 because replica has last-logged 
OpId of term: 1 index: 2, which is greater than that of the candidate, which 
has last-logged OpId of term: 1 index: 1.
I0729 18:59:47.894548 11669 raft_consensus.cc:1324] T 
e3503c47a21649ca931234999cd0bb45 P 3522a8de8170476dba0beb58cb2150d4 [term 1 
FOLLOWER]: Handling vote request from an unknown peer 
d4f64819170a4cf78fe4c9e9a72ec4b9
I0729 18:59:47.894688 11669 raft_consensus.cc:2019] T 
e3503c47a21649ca931234999cd0bb45 P 3522a8de8170476dba0beb58cb2150d4 [term 1 
FOLLOWER]: Advancing to term 2
I0729 18:59:47.896904 11669 raft_consensus.cc:1626] T 
e3503c47a21649ca931234999cd0bb45 P 3522a8de8170476dba0beb58cb2150d4 [term 2 
FOLLOWER]: Leader election vote request: Denying vote to candidate 
d4f64819170a4cf78fe4c9e9a72ec4b9 for term 2 because replica has last-logged 
OpId of term: 1 index: 2, which is greater than that of the candidate, which 
has last-logged OpId of term: 1 index: 1.
I0729 18:59:47.898056 11477 leader_election.cc:361] T 
e3503c47a21649ca931234999cd0bb45 P d4f64819170a4cf78fe4c9e9a72ec4b9 
[CANDIDATE]: Term 2 election: Vote denied by peer 
3522a8de8170476dba0beb58cb2150d4. Message: Invalid argument: T 
e3503c47a21649ca931234999cd0bb45 P 3522a8de8170476dba0beb58cb2150d4 [term 2 
FOLLOWER]: Leader election vote request: Denying vote to candidate 
d4f64819170a4cf78fe4c9e9a72ec4b9 for term 2 because replica has last-logged 
OpId of term: 1 index: 2, which is greater than that of the candidate, which 
has last-logged OpId of term: 1 index: 1.
I0729 18:59:47.898530 11477 leader_election.cc:248] T 
e3503c47a21649ca931234999cd0bb45 P d4f64819170a4cf78fe4c9e9a72ec4b9 
[CANDIDATE]: Term 2 election: Election decided. Result: candidate lost.
I0729 18:59:47.899269 11598 raft_consensus.cc:1821] T 
e3503c47a21649ca931234999cd0bb45 P d4f64819170a4cf78fe4c9e9a72ec4b9 [term 2 
FOLLOWER]: Leader election lost for term 2. Reason: None given
W0729 19:00:48.184541 11433 batcher.cc:315] Timed out: Failed to write batch of 
10 ops to tablet e3503c47a21649ca931234999cd0bb45 after 340 attempt(s): Failed 
to write to server: (no server available): Write(tablet: 
e3503c47a21649ca931234999cd0bb45, num_ops: 10, num_attempts: 340) passed its 
deadline: Illegal state: Replica 54197053abab4b6cb1b1632c9d1062dc is not leader 
of this config. Role: FOLLOWER. Consensus state: current_term: 2 leader_uuid: 
"" config { peers { permanent_uuid: "54197053abab4b6cb1b1632c9d1062dc" 
member_type: VOTER last_known_addr { host: "127.37.56.0" port: 33531 } } peers 
{ permanent_uuid: "3522a8de8170476dba0beb58cb2150d4" member_type: VOTER 
last_known_addr { host: "127.37.56.2" port: 53243 } } }
E0729 19:00:48.185212  9528 raft_consensus-itest.cc:246] Unexpected error: 
Timed out: Failed to write batch of 10 ops to tablet 
e3503c47a21649ca931234999cd0bb45 after 340 attempt(s): Failed to write to 
server: (no server available): Write(tablet: e3503c47a21649ca931234999cd0bb45, 
num_ops: 10, num_attempts: 340) passed its deadline: Illegal state: Replica 
54197053abab4b6cb1b1632c9d1062dc is not leader of this config. Role: FOLLOWER. 
Consensus state: current_term: 2 leader_uuid: "" config { peers { 
permanent_uuid: "54197053abab4b6cb1b1632c9d1062dc" member_type: VOTER 
last_known_addr { host: "127.37.56.0" port: 33531 } } peers { permanent_uuid: 
"3522a8de8170476dba0beb58cb2150d4" member_type: VOTER last_known_addr { host: 
"127.37.56.2" port: 53243 } } }

Change-Id: Ib91b5cc974656e82f670d6a938f537b63338d036
Reviewed-on: http://gerrit.cloudera.org:8080/3819
Tested-by: Kudu Jenkins
Reviewed-by: Todd Lipcon <t...@apache.org>
---
M src/kudu/integration-tests/cluster_itest_util.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
2 files changed, 3 insertions(+), 4 deletions(-)

Approvals:
  Todd Lipcon: Looks good to me, approved
  Kudu Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/3819
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: Ib91b5cc974656e82f670d6a938f537b63338d036
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy <mpe...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <din...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mpe...@apache.org>
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>

Reply via email to