Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/11981 )

Change subject: [raft_consensus-itest] fix flake in Test_KUDU_1735
......................................................................


Patch Set 2:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/11981/2/src/kudu/integration-tests/cluster_itest_util.h
File src/kudu/integration-tests/cluster_itest_util.h:

http://gerrit.cloudera.org:8080/#/c/11981/2/src/kudu/integration-tests/cluster_itest_util.h@138
PS2, Line 138: consensus::OpIdType op_id_type = consensus::RECEIVED_OPID
nit: doc for this?


http://gerrit.cloudera.org:8080/#/c/11981/2/src/kudu/integration-tests/raft_consensus-itest.cc
File src/kudu/integration-tests/raft_consensus-itest.cc:

http://gerrit.cloudera.org:8080/#/c/11981/2/src/kudu/integration-tests/raft_consensus-itest.cc@2468
PS2, Line 2468:   ASSERT_EQ(3, tablet_servers_.size());
Not a question about this patch specifically, but what's the point of testing 
3-4-3 with only 3 servers?


http://gerrit.cloudera.org:8080/#/c/11981/2/src/kudu/integration-tests/raft_consensus-itest.cc@2477
PS2, Line 2477: // Wait for at least one operation  committed in current term 
by the leader.
              :   // Otherwise, the RemoveServer() below would end up with 
IllegalState error:
              :   // 'Leader has not yet committed an operation in its own 
term'.
nit: this is helpful insofar as it describes what was failing before, but I 
still had to look into raft_consensus.cc to understand why this is the way it 
is. Maybe reword like:

"Wait for at least one operation to be committed in the current term before 
performing the config change. This is important since Raft enforces that an 
operation is committed in a given term before servicing a config change. 
Failure to do so may make this test yield errors upon removing the server."


http://gerrit.cloudera.org:8080/#/c/11981/2/src/kudu/integration-tests/raft_consensus-itest.cc@2500
PS2, Line 2500: affected_servers
nit: "affected" is pretty vague, so it's not as obvious as it could be in 
determining why this value gets updated when it does. Maybe call this 
"crashed_servers"?


http://gerrit.cloudera.org:8080/#/c/11981/2/src/kudu/integration-tests/raft_consensus-itest.cc@2512
PS2, Line 2512:
nit: not the fault of this patch, but could you take this extra space out?


http://gerrit.cloudera.org:8080/#/c/11981/2/src/kudu/integration-tests/raft_consensus-itest.cc@2517
PS2, Line 2517: In case of the 3-4-3 scheme, the catalog manager adds the 
evicted replica
              :     // back as a non-voter, and there might be two possible 
outcomes.
              :     // Particularly, after the tablet's Raft configuration has 
been updated,
              :     // the catalog manager sends DeleteTablet() request to the 
corresponding
              :     // tablet server upon receiving updated tablet report from 
leader replica.
I think the ordering of these sentences is a bit awkward. I was expecting the 
descriptions of the two outcomes to come immediately following their 
declaration (i.e. "there might be two possible outcomes"). It'd also be nice to 
explicitly mention the components of the race here. So maybe:

In the 3-4-3 case, the tablet's Raft config is updated to add the evicted 
replica back as a non-voter; around the same time, the catalog manager will 
send a DeleteTablet() request in response to the evicted replica. The 
concurrency of these operations may result in two outcomes:


http://gerrit.cloudera.org:8080/#/c/11981/2/src/kudu/integration-tests/raft_consensus-itest.cc@2522
PS2, Line 2522: 1. If the DeleteTablet() request succeeds before the leader 
replica
              :     //      receives ADD_PEER configuration follow-up update 
initiated by the
              :     //      catalog manager and re-discovers the evicted 
replica as a new peer
              :     //      with the LMP_MISMATCH status, the configuration 
update to add
              :     //      a new non-voter replica will stay pending since 
only one voter
              :     //      replica (the leader replica itself) is alive.
Is this LMP_MISMATCH information relevant to the outcome itself? Maybe this 
could be more succinct as:

1. If the DeleteTablet() request succeeds before the leader replica receives 
the ADD_PEER configuration, the update to add the new non-voter will remain 
pending and not be committed, since only one voter (the leader) is alive. Since 
no commit message is written to the WAL in this case, the tablet server will 
not hit the injected crash.

(if you go with this, you can also remove L2534)


http://gerrit.cloudera.org:8080/#/c/11981/2/src/kudu/integration-tests/raft_consensus-itest.cc@2536
PS2, Line 2536: MonoDelta::FromSeconds(5)
Why not kTimeout?


http://gerrit.cloudera.org:8080/#/c/11981/2/src/kudu/integration-tests/raft_consensus-itest.cc@2549
PS2, Line 2549: decltype(leader_tserver) srvs[] =
Is it possible to declare this as vector<decltype(leader_tserver)>? If so, 
could you add evicted_tserver only as appropriate, and use range-based 
iteration instead of the slightly awkward index-based iteration we have here? 
Namely, the fact that there's the `3 - affected_servers` strongly ties it in 
with the order of `srvs`; I think keeping it range-based rids future readers of 
having to think about that.



--
To view, visit http://gerrit.cloudera.org:8080/11981
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If44ad0e8363a3aead409484cff68843f1e5d6b6d
Gerrit-Change-Number: 11981
Gerrit-PatchSet: 2
Gerrit-Owner: Alexey Serbin <aser...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mpe...@apache.org>
Gerrit-Comment-Date: Sun, 25 Nov 2018 10:47:40 +0000
Gerrit-HasComments: Yes

Reply via email to