Mike Percy has posted comments on this change. Change subject: [consensus] Add consensus op-level lag metrics ......................................................................
Patch Set 10: (8 comments) http://gerrit.cloudera.org:8080/#/c/6451/10/src/kudu/consensus/consensus.proto File src/kudu/consensus/consensus.proto: Line 343: // The highest index that is known to be replicated by all members of Ha, nice catch http://gerrit.cloudera.org:8080/#/c/6451/10/src/kudu/consensus/consensus_queue-test.cc File src/kudu/consensus/consensus_queue-test.cc: Line 253: page_size_estimator.set_last_idx_appended_to_leader(0); I'm not sure of the purpose of the changes to this test. Does this mean this field is now required by the server? If so, this patch will break rolling upgrades (which are a nice-to-have when possible). Maybe we should provide a default value of 0 for this field so it doesn't have to be specified. http://gerrit.cloudera.org:8080/#/c/6451/10/src/kudu/consensus/consensus_queue.cc File src/kudu/consensus/consensus_queue.cc: PS10, Line 88: is believes it is Line 156: queue_state_.last_idx_appended_to_leader = queue_state_.last_appended.index(); Maybe we should not track this in leader mode and instead do this re-initialization in NonLeaderMode(). It should be safe to assume (for our purposes with this metric) that the leader must be ahead of us in order to be the leader to we can move this line to line 183 or so to imply that our lag between the time we are demoted to follower and the time we get our next message from the leader would be 0. PS10, Line 301: // If LEADER, update the index of the last appended operation. : queue_state_.last_idx_appended_to_leader = last_id.index(); Seems strange to me to track this at all as the leader; see below for a suggestion. PS10, Line 380: last_idx_appended_to_leader last_appended.index() Line 603: queue_state_.last_idx_appended_to_leader = std::max({queue_state_.last_idx_appended_to_leader, Based on the changes we have made, I don't think we want the max() anymore because we can assume that the leader is always right, and the latest message received from the leader should contain the correct value. The reason max() is a problem is because if there is a leader change and the new leader truncates the log then our metric will be inaccurate until the leader catches up to the index of the previous leader. I also find it a little bizarre that in this file we call UpdateLagMetrics(queue_state_.last_idx_appended_to_leader) only to again compare that value with queue_state_.last_idx_appended_to_leader I think it would make more sense to have 2 methods here: void PeerMessageQueue::UpdateLagMetrics() { // Don't update queue_state_, only set the metric. } void PeerMessageQueue::UpdateLastIndexAppendedToLeader(int64_t index) { queue_state_.last_idx_appended_to_leader = index; UpdateLagMetrics(); } That way, from within the PeerMessageQueue class we would only call UpdateLagMetrics() and from RaftConsensus (and tests) we would call UpdateLastIndexAppendedToLeader(...). PS10, Line 605: metrics_.num_ops_behind_leader->set_value( : queue_state_.last_idx_appended_to_leader - queue_state_.last_appended.index()); How about: int64_t num_ops_behind_leader = queue_state_.mode == LEADER ? 0 : queue_state_.last_idx_appended_to_leader - queue_state_.last_appended.index()); metrics_.num_ops_behind_leader->set_value(num_ops_behind_leader); -- To view, visit http://gerrit.cloudera.org:8080/6451 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36 Gerrit-PatchSet: 10 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: David Ribeiro Alves <dral...@apache.org> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <mpe...@apache.org> Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <t...@apache.org> Gerrit-HasComments: Yes