Will Berkeley has posted comments on this change. ( http://gerrit.cloudera.org:8080/10076 )
Change subject: KUDU-2287 Expose election failures as metrics ...................................................................... Patch Set 21: (7 comments) http://gerrit.cloudera.org:8080/#/c/10076/21/src/kudu/consensus/raft_consensus.cc File src/kudu/consensus/raft_consensus.cc: http://gerrit.cloudera.org:8080/#/c/10076/21/src/kudu/consensus/raft_consensus.cc@164 PS21, Line 164: 0 means this tablet server is the leader nit: It'd be possible to see 0 on a follower as well. Say instead, "This metric is identically zero on a leader replica." http://gerrit.cloudera.org:8080/#/c/10076/21/src/kudu/consensus/raft_consensus.cc@911 PS21, Line 911: last_leader_communication_time_micros_ = GetMonoTimeMicros(); I'd like to place this where the replica officially recognizes it has heard from the leader, which would be when it snoozes its failure detector in UpdateReplica around L1293. If we wait until after UpdateReplica we may have spent a while waiting on ops to be written to the log. http://gerrit.cloudera.org:8080/#/c/10076/21/src/kudu/integration-tests/raft_consensus-itest.cc File src/kudu/integration-tests/raft_consensus-itest.cc: http://gerrit.cloudera.org:8080/#/c/10076/21/src/kudu/integration-tests/raft_consensus-itest.cc@2689 PS21, Line 2689: // leader should always report 0 since last leader heartbeat nit: Here and below, comments should begin with a capital letter and end with a period or other terminal. http://gerrit.cloudera.org:8080/#/c/10076/21/src/kudu/integration-tests/raft_consensus-itest.cc@2702 PS21, Line 2702: no more than 1500ms This is reasonable, but it might make the test eventually flaky, say in TSAN with stress threads, so I think we can drop it. http://gerrit.cloudera.org:8080/#/c/10076/21/src/kudu/integration-tests/raft_consensus-itest.cc@2708 PS21, Line 2708: SleepFor(MonoDelta::FromMilliseconds(2000)); Lower -heartbeat_interval_ms and elections will trigger faster, so this test won't have to take 2+ seconds. http://gerrit.cloudera.org:8080/#/c/10076/21/src/kudu/integration-tests/raft_consensus-itest.cc@2709 PS21, Line 2709: ASSERT_TRUE(GetFailedElectionsSinceStableLeader(follower, tablet_id_) > 0); Just in the interests of avoiding flakiness, let's ASSERT_EVENTUALLY on this, since it's sufficient to know that eventually an election is called, fails, and we see it with the metric. http://gerrit.cloudera.org:8080/#/c/10076/21/src/kudu/integration-tests/raft_consensus-itest.cc@2710 PS21, Line 2710: } It'd also be nice to restart the tservers and make sure (ASSERT_EVENTUALLY) that the metric resets to zero, indicating a new leader. If with all this the test takes 2-3 seconds, maybe we should skip it unless AllowSlowTests() returns true? I'm not sure what our threshold for slow tests is. -- To view, visit http://gerrit.cloudera.org:8080/10076 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1b25df258cdba7bdae7bb2d7b4eb3d73b53425c3 Gerrit-Change-Number: 10076 Gerrit-PatchSet: 21 Gerrit-Owner: Attila Bukor <[email protected]> Gerrit-Reviewer: Attila Bukor <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-Reviewer: Will Berkeley <[email protected]> Gerrit-Comment-Date: Wed, 30 May 2018 16:10:28 +0000 Gerrit-HasComments: Yes
