Re: [DISCUSS] KIP-1322: Add metrics to Kraft that measure the number of fetch timeouts and election timeouts

Tony Tang via dev Fri, 08 May 2026 10:06:31 -0700

Hi Justine,

Thanks for the feedback!
I looked through the Kafka codebase and found a couple of existing timeout
metrics:


1.worker-poll-timeout-count in Kafka Connect, which is a cumulative count
of poll timeouts, implemented as a Gauge using an AtomicLong counter. This
is the closest to what we're proposing.
2. AcquisitionLockTimeoutPerSec in Share Groups, which is a Meter that
tracks the rate of acquisition lock timeouts.

I think our KIP follows a similar approach to worker-poll-timeout-count

Thanks,
Tony

On Thu, May 7, 2026 at 5:21 PM Justine Olshan via dev <[email protected]>
wrote:

> Hey Tony,
>
> Thanks for the KIP. Overall, I think the idea makes sense and it seems like
> you and Kevin are getting closer to agreement on the exact definition of
> the metrics.
> I was curious, are there any other timeout metrics you can find in Kafka
> and how are those defined? We don't necessarily need to do the same, but
> was curious if there was any precedent for this type of metric.
>
> Thanks,
> Justine
>
> On Mon, May 4, 2026 at 9:25 AM Tony Tang via dev <[email protected]>
> wrote:
>
> > Hi Kevin,
> >
> > Thanks for the reply.
> >
> > KW3: Ok, I agree that metric values at time X should be an accurate
> > snapshot of the node at time X, and that collecting historical values is
> > not the responsibility of kafka. The approach you described makes sense
> to
> > me: we keep the internal counter across state transitions, but only
> > register/unregister the metric based on whether the underlying timer
> > exists. I'm on board with that approach
> >
> > KW4: That's a great suggestion. However, I think it's out of scope for
> this
> > KIP. I'd prefer to keep this KIP focused on timeout expiration metrics
> and
> > maybe we can discuss the state transition counting in the future.
> >
> > Best,
> > Tony
> >
> > On Fri, May 1, 2026 at 5:12 PM Kevin Wu <[email protected]> wrote:
> >
> > > Hi Tony,
> > >
> > > Thanks for the reply.
> > >
> > > KW3: I don't think "count" metrics like the ones being discussed in
> this
> > > KIP should report values when the objects they are associated with do
> not
> > > exist. This would mean metric values at time X are not an accurate
> > > "snapshot" of the Kafka node at time X. In my opinion, collecting the
> > > historic values for a metric, visualizing them through dashboards, and
> > > monitoring them to alert operators are the responsibilities of a
> > downstream
> > > observability software, not Kafka. Kafka does have the capability to
> > create
> > > "derivative" metrics (i.e. metrics whose values are based off of
> sampling
> > > something else over time) via Sensors, but I don't think that fits our
> > use
> > > case as previously discussed.
> > >
> > > Another way to think about it is that adding or removing a metric so
> > Kafka
> > > starts or stops reporting a value to an observability service actually
> > > tells the user more information about Kafka compared to unconditionally
> > > reporting said metric value. Additionally, just because you remove a
> > metric
> > > does not mean you need to remove the corresponding counter within the
> > > KafkaRaftMetrics object. For example, a node fetch request times out 5
> > > times as a follower, so the node reports 5 for the metric. Next, the
> node
> > > becomes the leader, so it stops reporting the metric (i.e. the metric
> has
> > > no value). Then it becomes a follower, and starts reporting 5 again.
> When
> > > the next fetch timeout happens, the node reports 6 for the metric. What
> > do
> > > you think about this behavior?
> > >
> > > KW4: If the desire is for a metric that always reports a value, what do
> > you
> > > think about a metric that counts the number of `EpochState`
> transitions?
> > I
> > > think this value makes sense to report this value for the lifetime of a
> > > process, and generally, frequent state transitions are an indication
> > > something is wrong with the cluster. This would be an additional metric
> > to
> > > the ones we discussed above.
> > >
> > > Best,
> > > Kevin Wu
> > >
> > > On Fri, May 1, 2026 at 12:59 PM Tony Tang via dev <
> [email protected]>
> > > wrote:
> > >
> > > > Hi Kevin,
> > > >
> > > > Thanks for the reply. Very insightful points.
> > > >
> > > > KW1: Yes, using a single tagged metric makes sense. it's cleaner and
> > more
> > > > extensible. I'll adopt this approach.
> > > >
> > > > KW2: Yes, we don't need to use `CumulativeCount`. Already updated in
> > the
> > > > KIP
> > > >
> > > > KW3: I understand each timer is only meaningful in certain states,
> but
> > > the
> > > > metric value is still useful for operational monitoring regardless of
> > the
> > > > current state. It tells you how many times a timeout has expired over
> > the
> > > > lifetime of the node. Hiding or clearing the metric when the node
> isn't
> > > in
> > > > the relevant state could actually make it harder for users to
> diagnose
> > > > historical issues, since they'd need to catch the metric while the
> node
> > > > happens to be in the right state. For example, if a follower had
> > repeated
> > > > fetch timeout expirations and then transitions to a candidate/leader,
> > the
> > > > metrics would still be valuable for diagnosing why the leader
> election
> > > > happened in the first place, right? If we cleared the metric on state
> > > > transition, that information would be lost. The question is : Do we
> > only
> > > > want the metric to reflect only the latest state, or the overall
> > timeout
> > > > behavior over the node's lifetime? I lean toward the latter, as it
> > > provides
> > > > more useful information for monitoring network issues. To avoid
> > > confusion,
> > > > maybe we can use the metric name lifetime-timeout-count + tag
> > > > timer-name=fetch/election? What do you think?
> > > >
> > > >
> > > >
> > > >
> > > > On Thu, Apr 30, 2026 at 3:03 PM Kevin Wu <[email protected]>
> > wrote:
> > > >
> > > > > Hi Tony,
> > > > >
> > > > > Thanks for the KIP. I agree that having metrics for timeouts in
> KRaft
> > > > would
> > > > > be a nice addition. I have a few high level comments about the KIP:
> > > > >
> > > > > KW1: Did you consider making a tagged metric like
> > `number-of-timeouts`
> > > > > instead of individual metrics? You could tag by the timer name
> (e.g.
> > > > fetch,
> > > > > election, update-voter, check-quorum, and begin-quorum-epoch etc.)
> > > since
> > > > > KRaft supports several kinds of timers, and may add more in the
> > future.
> > > > You
> > > > > can look at `NodeMetrics.java` and
> > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1180*3A*Add*generic*feature*level*metrics__;JSsrKysr!!Ayb5sqE7!qPjsZ_186iR3QjEak9hmexMYOhwzDGzvcLYwnVUujYlAy2wAAQchfvSKMr9oG7Mygg608Vz6zFCv5QDQFYUcvow$
> > > > > for an example of tagged metrics using Kafka's new metrics
> library. I
> > > > think
> > > > > there is an argument we should add timeout metrics for some of
> these
> > > > other
> > > > > KRaft timers I mentioned, since reporting them could also help
> > > operators
> > > > > diagnose network partitions or possible software bugs.
> > > > >
> > > > > KW2: I see the "Type" for each metric is `CumulativeCount`. I think
> > > this
> > > > > might be overkill, and that we could just use Integer for the data
> > > type,
> > > > > and expose an increment method for each metric. In general, sensors
> > are
> > > > > used for when multiple metrics are associated with a specific
> concept
> > > > (e.g.
> > > > > `commit-latency-avg` and `commit-latency-max` are two different
> > metrics
> > > > > associated with the same concept of "commit latency"). It is hard
> for
> > > me
> > > > to
> > > > > imagine that the number of timeouts occurring would have more than
> > one
> > > > > metric associated with it.
> > > > >
> > > > > KW3: Each of these timers is associated with an EpochState (e.g.
> the
> > > > fetch
> > > > > timer with FollowerState, check quorum timer with LeaderState,
> etc.).
> > > > What
> > > > > should the value of these metrics be when a node transitions
> between
> > > > > EpochStates? Should we stop reporting the metrics associated with
> the
> > > old
> > > > > EpochState, and start reporting the metrics associated with the new
> > > > > EpochState? I personally think it might be confusing if these
> metrics
> > > > > report values even if the underlying timer does not exist on the
> > node.
> > > > For
> > > > > example, the fetch timeout metric reporting a value when the local
> > node
> > > > is
> > > > > the KRaft leader seems odd to me. When we added metrics for KIP-853
> > > > > associated with the leader (e.g. `uncommitted-voter-change`), we
> > > decided
> > > > to
> > > > > only report values for those metrics when the local node was the
> > > leader.
> > > > It
> > > > > would be nice if we could follow that convention for these metrics
> > too,
> > > > and
> > > > > document which states report which metrics in the KIP. What do you
> > > think?
> > > > >
> > > > > Best,
> > > > > Kevin Wu
> > > > >
> > > > > On Tue, Apr 21, 2026 at 12:32 PM Tony Tang via dev <
> > > [email protected]
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Hello everyone,
> > > > > >
> > > > > > I'd like to start a discussion on KIP-1322: Add metrics to Kraft
> > that
> > > > > > measure the number of fetch timeouts and election timeouts <
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1322*3A*Add*metrics*to*Kraft*that*measure*the*number*of*fetch*timeouts*and*election*timeouts__;JSsrKysrKysrKysrKysr!!Ayb5sqE7!qPjsZ_186iR3QjEak9hmexMYOhwzDGzvcLYwnVUujYlAy2wAAQchfvSKMr9oG7Mygg608Vz6zFCv5QDQLt1GBmw$
> > > > > > >
> > > > > >
> > > > > > This proposal aims to add new metrics to KRaft that track how
> often
> > > > fetch
> > > > > > timeouts and election timeouts occur.
> > > > > >
> > > > > > Best regards,
> > > > > > Tony Tang
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-1322: Add metrics to Kraft that measure the number of fetch timeouts and election timeouts

Reply via email to