[
https://issues.apache.org/jira/browse/RATIS-651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16916110#comment-16916110
]
Josh Elser commented on RATIS-651:
----------------------------------
Was talking to Aravindan in chat – trying to summarize for the record:
What he has added in the first patch is a "heartbeat miss count" from each
participant in a Raft Group. This is nice to know when looking at a specific
Peer, we could see that this node knows that it's not heartbeat'ing (maybe, a
network partition!). However, a miss count from the perspective of a follower
doesn't give us insight into the health of a Group, specifically, does the
Leader still have a majority of nodes participating (thus, can new appends
happen).
There are a couple of ways that we could present this information: one way that
Aravindan suggested would be to report a metric which is the amount of time
since we last got a heartbeat from a peer. This would end up looking like a
sawtooth graph. When we do this for each Peer in a RaftGroup, we could easily
validate that all Peers are heartbeating with the Leader (e.g. all metrics have
a small value)
{quote}define an new metrics for HeartBeat itself and aggregate it in the
LeaderState.
{quote}
I think this sounds like the right approach. The LeaderState has its SenderList
(which is just a List of LogAppender's). If, from LeaderState, we could export
metrics for "last heard response" for each Peer, I think this is exactly what I
was hoping to find!
e.g. something like
{code:java}
senderList.forEach((appender) ->
server.getLeaderElectionMetricsRegistry().reportLastHeartbeat(
appender.toString(),
appender.getFollower().getLastRpcResponseTime().elapsedTimeMs())
); {code}
So, we would report the peer's name and the last time we heard from that peer.
> Add metrics related to leaderElection and HeartBeat
> ---------------------------------------------------
>
> Key: RATIS-651
> URL: https://issues.apache.org/jira/browse/RATIS-651
> Project: Ratis
> Issue Type: Sub-task
> Components: server
> Affects Versions: 0.4.0
> Reporter: Shashikant Banerjee
> Assignee: Aravindan Vijayan
> Priority: Critical
> Attachments: RATIS-651-000.patch
>
>
> Following metrics would be helpful to determine the leader election events
> and timeouts:
>
> |numLeaderElections|Number of leader elections since the creation of ratis
> pipeline|
> |numLeaderElectionTimeouts|Number of leader election timeouts or failures|
> |LeaderElectionCompletionLatency|Time required to complete a leader election|
> |MaxNoLeaderInterval|Max time where there has been no elected leader in the
> raft ring|
> |heartBeatMissCount|No of times heartBeat response is missed from a server |
--
This message was sent by Atlassian Jira
(v8.3.2#803003)