Re: [DISCUSS] KIP-1322: Add metrics to Kraft that measure the number of fetch timeouts and election timeouts

Kevin Wu Thu, 30 Apr 2026 13:03:57 -0700

Hi Tony,

Thanks for the KIP. I agree that having metrics for timeouts in KRaft would
be a nice addition. I have a few high level comments about the KIP:

KW1: Did you consider making a tagged metric like `number-of-timeouts`
instead of individual metrics? You could tag by the timer name (e.g. fetch,
election, update-voter, check-quorum, and begin-quorum-epoch etc.) since
KRaft supports several kinds of timers, and may add more in the future. You
can look at `NodeMetrics.java` and
https://cwiki.apache.org/confluence/display/KAFKA/KIP-1180%3A+Add+generic+feature+level+metrics
for an example of tagged metrics using Kafka's new metrics library. I think
there is an argument we should add timeout metrics for some of these other
KRaft timers I mentioned, since reporting them could also help operators
diagnose network partitions or possible software bugs.

KW2: I see the "Type" for each metric is `CumulativeCount`. I think this
might be overkill, and that we could just use Integer for the data type,
and expose an increment method for each metric. In general, sensors are
used for when multiple metrics are associated with a specific concept (e.g.
`commit-latency-avg` and `commit-latency-max` are two different metrics
associated with the same concept of "commit latency"). It is hard for me to
imagine that the number of timeouts occurring would have more than one
metric associated with it.

KW3: Each of these timers is associated with an EpochState (e.g. the fetch
timer with FollowerState, check quorum timer with LeaderState, etc.). What
should the value of these metrics be when a node transitions between
EpochStates? Should we stop reporting the metrics associated with the old
EpochState, and start reporting the metrics associated with the new
EpochState? I personally think it might be confusing if these metrics
report values even if the underlying timer does not exist on the node. For
example, the fetch timeout metric reporting a value when the local node is
the KRaft leader seems odd to me. When we added metrics for KIP-853
associated with the leader (e.g. `uncommitted-voter-change`), we decided to
only report values for those metrics when the local node was the leader. It
would be nice if we could follow that convention for these metrics too, and
document which states report which metrics in the KIP. What do you think?

Best,
Kevin Wu

On Tue, Apr 21, 2026 at 12:32 PM Tony Tang via dev <[email protected]>
wrote:

> Hello everyone,
>
> I'd like to start a discussion on KIP-1322: Add metrics to Kraft that
> measure the number of fetch timeouts and election timeouts <
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1322%3A+Add+metrics+to+Kraft+that+measure+the+number+of+fetch+timeouts+and+election+timeouts
> >
>
> This proposal aims to add new metrics to KRaft that track how often fetch
> timeouts and election timeouts occur.
>
> Best regards,
> Tony Tang
>

Re: [DISCUSS] KIP-1322: Add metrics to Kraft that measure the number of fetch timeouts and election timeouts

Reply via email to