[ 
https://issues.apache.org/jira/browse/RATIS-651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16916110#comment-16916110
 ] 

Josh Elser commented on RATIS-651:
----------------------------------

Was talking to Aravindan in chat – trying to summarize for the record:

What he has added in the first patch is a "heartbeat miss count" from each 
participant in a Raft Group. This is nice to know when looking at a specific 
Peer, we could see that this node knows that it's not heartbeat'ing (maybe, a 
network partition!). However, a miss count from the perspective of a follower 
doesn't give us insight into the health of a Group, specifically, does the 
Leader still have a majority of nodes participating (thus, can new appends 
happen).

There are a couple of ways that we could present this information: one way that 
Aravindan suggested would be to report a metric which is the amount of time 
since we last got a heartbeat from a peer. This would end up looking like a 
sawtooth graph. When we do this for each Peer in a RaftGroup, we could easily 
validate that all Peers are heartbeating with the Leader (e.g. all metrics have 
a small value)
{quote}define an new metrics for HeartBeat itself and aggregate it in the 
LeaderState.
{quote}
I think this sounds like the right approach. The LeaderState has its SenderList 
(which is just a List of LogAppender's). If, from LeaderState, we could export 
metrics for "last heard response" for each Peer, I think this is exactly what I 
was hoping to find!

e.g. something like
{code:java}
senderList.forEach((appender) -> 
server.getLeaderElectionMetricsRegistry().reportLastHeartbeat(
    appender.toString(),
    appender.getFollower().getLastRpcResponseTime().elapsedTimeMs())
); {code}
So, we would report the peer's name and the last time we heard from that peer.

> Add metrics related to leaderElection and HeartBeat
> ---------------------------------------------------
>
>                 Key: RATIS-651
>                 URL: https://issues.apache.org/jira/browse/RATIS-651
>             Project: Ratis
>          Issue Type: Sub-task
>          Components: server
>    Affects Versions: 0.4.0
>            Reporter: Shashikant Banerjee
>            Assignee: Aravindan Vijayan
>            Priority: Critical
>         Attachments: RATIS-651-000.patch
>
>
> Following metrics would be helpful to determine the leader election events 
> and timeouts:
>  
> |numLeaderElections|Number of leader elections since the creation of ratis 
> pipeline|
> |numLeaderElectionTimeouts|Number of leader election timeouts or failures|
> |LeaderElectionCompletionLatency|Time required to complete a leader election|
> |MaxNoLeaderInterval|Max time where there has been no elected leader in the 
> raft ring|
> |heartBeatMissCount|No of times heartBeat response is missed from a server |



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to