Hi Tony, Thanks for the KIP. I agree that having metrics for timeouts in KRaft would be a nice addition. I have a few high level comments about the KIP:
KW1: Did you consider making a tagged metric like `number-of-timeouts` instead of individual metrics? You could tag by the timer name (e.g. fetch, election, update-voter, check-quorum, and begin-quorum-epoch etc.) since KRaft supports several kinds of timers, and may add more in the future. You can look at `NodeMetrics.java` and https://cwiki.apache.org/confluence/display/KAFKA/KIP-1180%3A+Add+generic+feature+level+metrics for an example of tagged metrics using Kafka's new metrics library. I think there is an argument we should add timeout metrics for some of these other KRaft timers I mentioned, since reporting them could also help operators diagnose network partitions or possible software bugs. KW2: I see the "Type" for each metric is `CumulativeCount`. I think this might be overkill, and that we could just use Integer for the data type, and expose an increment method for each metric. In general, sensors are used for when multiple metrics are associated with a specific concept (e.g. `commit-latency-avg` and `commit-latency-max` are two different metrics associated with the same concept of "commit latency"). It is hard for me to imagine that the number of timeouts occurring would have more than one metric associated with it. KW3: Each of these timers is associated with an EpochState (e.g. the fetch timer with FollowerState, check quorum timer with LeaderState, etc.). What should the value of these metrics be when a node transitions between EpochStates? Should we stop reporting the metrics associated with the old EpochState, and start reporting the metrics associated with the new EpochState? I personally think it might be confusing if these metrics report values even if the underlying timer does not exist on the node. For example, the fetch timeout metric reporting a value when the local node is the KRaft leader seems odd to me. When we added metrics for KIP-853 associated with the leader (e.g. `uncommitted-voter-change`), we decided to only report values for those metrics when the local node was the leader. It would be nice if we could follow that convention for these metrics too, and document which states report which metrics in the KIP. What do you think? Best, Kevin Wu On Tue, Apr 21, 2026 at 12:32 PM Tony Tang via dev <[email protected]> wrote: > Hello everyone, > > I'd like to start a discussion on KIP-1322: Add metrics to Kraft that > measure the number of fetch timeouts and election timeouts < > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1322%3A+Add+metrics+to+Kraft+that+measure+the+number+of+fetch+timeouts+and+election+timeouts > > > > This proposal aims to add new metrics to KRaft that track how often fetch > timeouts and election timeouts occur. > > Best regards, > Tony Tang >
