Will Berkeley has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10076 )

Change subject: KUDU-2287 Expose election failures as metrics
......................................................................


Patch Set 21:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/10076/21/src/kudu/consensus/raft_consensus.cc
File src/kudu/consensus/raft_consensus.cc:

http://gerrit.cloudera.org:8080/#/c/10076/21/src/kudu/consensus/raft_consensus.cc@164
PS21, Line 164: 0 means this tablet server is the leader
nit: It'd be possible to see 0 on a follower as well. Say instead, "This metric 
is identically zero on a leader replica."


http://gerrit.cloudera.org:8080/#/c/10076/21/src/kudu/consensus/raft_consensus.cc@911
PS21, Line 911: last_leader_communication_time_micros_ = GetMonoTimeMicros();
I'd like to place this where the replica officially recognizes it has heard 
from the leader, which would be when it snoozes its failure detector in 
UpdateReplica around L1293. If we wait until after UpdateReplica we may have 
spent a while waiting on ops to be written to the log.


http://gerrit.cloudera.org:8080/#/c/10076/21/src/kudu/integration-tests/raft_consensus-itest.cc
File src/kudu/integration-tests/raft_consensus-itest.cc:

http://gerrit.cloudera.org:8080/#/c/10076/21/src/kudu/integration-tests/raft_consensus-itest.cc@2689
PS21, Line 2689: // leader should always report 0 since last leader heartbeat
nit: Here and below, comments should begin with a capital letter and end with a 
period or other terminal.


http://gerrit.cloudera.org:8080/#/c/10076/21/src/kudu/integration-tests/raft_consensus-itest.cc@2702
PS21, Line 2702: no more than 1500ms
This is reasonable, but it might make the test eventually flaky, say in TSAN 
with stress threads, so I think we can drop it.


http://gerrit.cloudera.org:8080/#/c/10076/21/src/kudu/integration-tests/raft_consensus-itest.cc@2708
PS21, Line 2708: SleepFor(MonoDelta::FromMilliseconds(2000));
Lower -heartbeat_interval_ms and elections will trigger faster, so this test 
won't have to take 2+ seconds.


http://gerrit.cloudera.org:8080/#/c/10076/21/src/kudu/integration-tests/raft_consensus-itest.cc@2709
PS21, Line 2709: ASSERT_TRUE(GetFailedElectionsSinceStableLeader(follower, 
tablet_id_) > 0);
Just in the interests of avoiding flakiness, let's ASSERT_EVENTUALLY on this, 
since it's sufficient to know that eventually an election is called, fails, and 
we see it with the metric.


http://gerrit.cloudera.org:8080/#/c/10076/21/src/kudu/integration-tests/raft_consensus-itest.cc@2710
PS21, Line 2710: }
It'd also be nice to restart the tservers and make sure (ASSERT_EVENTUALLY) 
that the metric resets to zero, indicating a new leader.

If with all this the test takes 2-3 seconds, maybe we should skip it unless 
AllowSlowTests() returns true? I'm not sure what our threshold for slow tests 
is.



--
To view, visit http://gerrit.cloudera.org:8080/10076
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1b25df258cdba7bdae7bb2d7b4eb3d73b53425c3
Gerrit-Change-Number: 10076
Gerrit-PatchSet: 21
Gerrit-Owner: Attila Bukor <[email protected]>
Gerrit-Reviewer: Attila Bukor <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <[email protected]>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-Reviewer: Will Berkeley <[email protected]>
Gerrit-Comment-Date: Wed, 30 May 2018 16:10:28 +0000
Gerrit-HasComments: Yes

Reply via email to