Re: [DISCUSS] KIP-1322: Add metrics to Kraft that measure the number of fetch timeouts and election timeouts

Tony Tang via dev Tue, 26 May 2026 09:39:39 -0700

Hi Justine,
Thanks for the reply. I added a Rejected Alternatives section in the KIP.


Thanks,
Tony

On Fri, May 22, 2026 at 5:25 PM Justine Olshan via dev <[email protected]>
wrote:

> Hey Tony,
> Sorry for the delay on the response. Thanks for sharing the metrics we have
> in Kafka.
>
> I think given the approaches we chose, it could be useful to include a
> section in the KIP for rejected alternatives with a summary of things we
> considered, but chose not to implement.
> That can be useful to folks who may not follow this discussion thread, but
> may have questions about certain design choices. Things like why gauge vs.
> meter and maybe some of the discussion above about whether to reset the
> values given the state.
>
> Thanks,
> Justine
>
>
>
> On Thu, May 14, 2026 at 10:55 AM Tony Tang via dev <[email protected]>
> wrote:
>
> > Hi Kevin,
> >
> > I think we are aligned that the metric should
> > > increment in value when its corresponding timer goes from unexpired to
> > > expired during a poll of the raft client. It should not increment in
> > value
> > > if the timer goes from expired to expired during a poll of the raft
> > client.
> >
> >
> > Yeah I agree with that. I'll add this point to the KIP
> >
> > Best,
> > Tony
> >
> > On Thu, May 14, 2026 at 12:09 PM Kevin Wu <[email protected]>
> wrote:
> >
> > > Hi Tony,
> > >
> > > Yeah we should always report these metrics. When/how to increment them
> is
> > > an implementation detail, but I think we are aligned that the metric
> > should
> > > increment in value when its corresponding timer goes from unexpired to
> > > expired during a poll of the raft client. It should not increment in
> > value
> > > if the timer goes from expired to expired during a poll of the raft
> > client.
> > > What do you think?
> > >
> > > Best,
> > > Kevin Wu
> > >
> > > On Wed, May 13, 2026 at 7:24 PM Tony Tang via dev <
> [email protected]>
> > > wrote:
> > >
> > > > Hi Kevin,
> > > >
> > > > Re KW3: That makes sense to me. In particular, if we only register
> the
> > > > election timeout metric during Unattached/Prospective/Candidate,
> those
> > > are
> > > > all short states. After a timeout the node changes status, and the
> > metric
> > > > gets unregistered. Since nodes spend most of their time in Follower
> or
> > > > Leader, operators may never see this metric at all. So yeah, let's
> just
> > > > always report these metrics.
> > > >
> > > > Re KW8:
> > > >
> > > > Implementing this like case 1 could diverge from the meaning of the
> > > metric
> > > > > in some cases. For example, expiring the fetch timeout for a voter
> > > always
> > > > > causes a transition to Prospective, so this metric is not updated
> on
> > > the
> > > > > next poll(). It also stops getting reported.
> > > >
> > > > Since we decide not registering/unregistering metrics based on state,
> > the
> > > > above concerns disappear, right?
> > > >
> > > > observers do not transition to another state
> > > > > when its fetch timeout expires, so every subsequent poll would
> > increase
> > > > > this metric's value, which is technically wrong based on what we
> > > defined
> > > > in
> > > > > the KIP.
> > > >
> > > > Okay, I understand the observer concern. Let's assume this KIP will
> be
> > > > implemented after KAFKA-20514, so the observer fetch timeout behavior
> > > will
> > > > already be fixed by then.
> > > >
> > > > In summary:
> > > > 1. we always report the metrics.
> > > > 2. we only increment the counter in the poll methods when isExpired()
> > is
> > > > called, as we originally planned. Does that work for you?
> > > >
> > > > Thanks,
> > > > Tony
> > > >
> > > > On Wed, May 13, 2026 at 12:10 PM Kevin Wu <[email protected]>
> > > wrote:
> > > >
> > > > > Hi Tony,
> > > > >
> > > > > KW3: Sorry for going back and forth on this, but one issue I see
> with
> > > not
> > > > > reporting these metrics when the node's current EpochState does not
> > > > contain
> > > > > the timer is that downstream observability platforms may not be
> able
> > to
> > > > > capture these values very well. For example, the fetch timeout can
> > only
> > > > > expire once on a FollowerAsVoter before it transitions to
> > Prospective,
> > > > and
> > > > > that value would not be reported/visible to operators. Only upon
> > > > returning
> > > > > to FollowerState would the operator see the fetch timeout metric
> > again,
> > > > > which cannot increment without first not reporting any value. That
> is
> > > not
> > > > > intuitive metric behavior to me, as the operator would see the
> value
> > > "too
> > > > > late".
> > > > >
> > > > > In summary, I think it is fine to always report these metrics, and
> if
> > > it
> > > > > turns out that is wrong or confusing, that's an implementation
> detail
> > > we
> > > > > can fix later. I think I was trying to draw too many parallels with
> > the
> > > > > KIP-853 leader metrics I implemented previously. Leader and
> Follower
> > > are
> > > > > "long-lived" states so I think it makes sense to report metrics
> which
> > > > only
> > > > > change in these states, during these states. Timer expirations
> > > > specifically
> > > > > cause transitions to and between more "intermittent" states (e.g.
> > > > > Prospective, Candidate, etc.), and therefore occur at the "end" of
> an
> > > > > EpochState, so registering and unregistering them does not seem
> like
> > a
> > > > good
> > > > > idea.
> > > > >
> > > > > KW8: Yeah, basically. Implementing this like case 1 could diverge
> > from
> > > > the
> > > > > meaning of the metric in some cases. For example, expiring the
> fetch
> > > > > timeout for a voter always causes a transition to Prospective, so
> > this
> > > > > metric is not updated on the next poll(). It also stops getting
> > > reported.
> > > > > However, in the current code, observers do not transition to
> another
> > > > state
> > > > > when its fetch timeout expires, so every subsequent poll would
> > increase
> > > > > this metric's value, which is technically wrong based on what we
> > > defined
> > > > in
> > > > > the KIP.
> > > > >
> > > >
> > >
> >
> https://urldefense.com/v3/__https://issues.apache.org/jira/browse/KAFKA-20514__;!!Ayb5sqE7!qUcUv6MZLg9t1tNcXiCJCDe1vw-YajXBZeS9xD-Un3niEylSMbS3gHZyccO2mgQha4feqEvYJSE5sbQTnAhaooA$
> > > > > fixes the issue
> > > > > for the fetch timer specifically.
> > > > >
> > > > > This comment specifically is just something I wanted to make you
> > aware
> > > of
> > > > > during your implementation, since it was not super obvious to me
> that
> > > > this
> > > > > interaction with poll(), EpochState transitions, and our proposed
> > > metrics
> > > > > occurred until a deeper look at the code.
> > > > >
> > > > > Best,
> > > > > Kevin Wu
> > > > >
> > > > > On Wed, May 13, 2026 at 8:51 AM Tony Tang via dev <
> > > [email protected]>
> > > > > wrote:
> > > > >
> > > > > > Hi Kevin,
> > > > > >
> > > > > > Thanks for the feedback.
> > > > > > Re KW 5,6,7: I agree with all three. I'll update the KIP
> > > > > > Re KW 8:
> > > > > > Case 1: The the poll methods in KafkaRaftClient actively calls
> > > > > isExpired()
> > > > > > on the timer, and when it finds the timer has expired, it
> > increments
> > > > the
> > > > > > counter. The counting logic lives in the caller
> > > > > > Case 2: The timer itself is aware of its expiration and
> > automatically
> > > > > > increments the counter when it transitions from not expired to
> > > expired.
> > > > > The
> > > > > > counting logic lives inside the timer.
> > > > > > Is that the cases you're talking about?
> > > > > >
> > > > > > Best,
> > > > > > Tony
> > > > > >
> > > > > >
> > > > > > On Tue, May 12, 2026 at 5:42 PM Kevin Wu <[email protected]
> >
> > > > wrote:
> > > > > >
> > > > > > > Hi Tony,
> > > > > > >
> > > > > > > Thanks for the updates to the KIP. I have a few minor comments:
> > > > > > >
> > > > > > > KW5: nit, but can we say “this metric is registered only…”
> > instead
> > > of
> > > > > > > “Registered only…” in the description column?
> > > > > > >
> > > > > > > KW6: What do you think about renaming the metric to
> > > > > > “timeout-expirations”?
> > > > > > > This would be slightly less verbose than what we have now.
> > > > > > >
> > > > > > > KW7: Instead of having separate columns for the metric tag and
> > > name,
> > > > > can
> > > > > > we
> > > > > > > have one column with the entire MBean name? For example:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> “kafka.server:type=node-metrics,name=maximum-supported-level,feature-name=X”.
> > > > > > >
> > > > > > > KW8: Something to consider in determining when these metric
> > values
> > > > > should
> > > > > > > be incremented: should  the value be incremented when the raft
> > > client
> > > > > > reads
> > > > > > > the timer and finds it has expired? Or, should the value be
> > > > incremented
> > > > > > > only when the timer goes from not expired to expired? The first
> > is
> > > > > simple
> > > > > > > but means the metric value is slightly different than what
> we’ve
> > > > > > described
> > > > > > > in the KIP. I feel like implementing the second may be pretty
> > > > complex…
> > > > > > >
> > > > > > > Best,
> > > > > > > Kevin Wu
> > > > > > >
> > > > > > > On Fri, May 8, 2026 at 10:06 AM Tony Tang via dev <
> > > > > [email protected]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Justine,
> > > > > > > >
> > > > > > > > Thanks for the feedback!
> > > > > > > > I looked through the Kafka codebase and found a couple of
> > > existing
> > > > > > > timeout
> > > > > > > > metrics:
> > > > > > > >
> > > > > > > > 1.worker-poll-timeout-count in Kafka Connect, which is a
> > > cumulative
> > > > > > count
> > > > > > > > of poll timeouts, implemented as a Gauge using an AtomicLong
> > > > counter.
> > > > > > > This
> > > > > > > > is the closest to what we're proposing.
> > > > > > > > 2. AcquisitionLockTimeoutPerSec in Share Groups, which is a
> > Meter
> > > > > that
> > > > > > > > tracks the rate of acquisition lock timeouts.
> > > > > > > >
> > > > > > > > I think our KIP follows a similar approach to
> > > > > worker-poll-timeout-count
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Tony
> > > > > > > >
> > > > > > > > On Thu, May 7, 2026 at 5:21 PM Justine Olshan via dev <
> > > > > > > > [email protected]>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hey Tony,
> > > > > > > > >
> > > > > > > > > Thanks for the KIP. Overall, I think the idea makes sense
> and
> > > it
> > > > > > seems
> > > > > > > > like
> > > > > > > > > you and Kevin are getting closer to agreement on the exact
> > > > > definition
> > > > > > > of
> > > > > > > > > the metrics.
> > > > > > > > > I was curious, are there any other timeout metrics you can
> > find
> > > > in
> > > > > > > Kafka
> > > > > > > > > and how are those defined? We don't necessarily need to do
> > the
> > > > > same,
> > > > > > > but
> > > > > > > > > was curious if there was any precedent for this type of
> > metric.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Justine
> > > > > > > > >
> > > > > > > > > On Mon, May 4, 2026 at 9:25 AM Tony Tang via dev <
> > > > > > [email protected]
> > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Kevin,
> > > > > > > > > >
> > > > > > > > > > Thanks for the reply.
> > > > > > > > > >
> > > > > > > > > > KW3: Ok, I agree that metric values at time X should be
> an
> > > > > accurate
> > > > > > > > > > snapshot of the node at time X, and that collecting
> > > historical
> > > > > > values
> > > > > > > > is
> > > > > > > > > > not the responsibility of kafka. The approach you
> described
> > > > makes
> > > > > > > sense
> > > > > > > > > to
> > > > > > > > > > me: we keep the internal counter across state
> transitions,
> > > but
> > > > > only
> > > > > > > > > > register/unregister the metric based on whether the
> > > underlying
> > > > > > timer
> > > > > > > > > > exists. I'm on board with that approach
> > > > > > > > > >
> > > > > > > > > > KW4: That's a great suggestion. However, I think it's out
> > of
> > > > > scope
> > > > > > > for
> > > > > > > > > this
> > > > > > > > > > KIP. I'd prefer to keep this KIP focused on timeout
> > > expiration
> > > > > > > metrics
> > > > > > > > > and
> > > > > > > > > > maybe we can discuss the state transition counting in the
> > > > future.
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Tony
> > > > > > > > > >
> > > > > > > > > > On Fri, May 1, 2026 at 5:12 PM Kevin Wu <
> > > > [email protected]>
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi Tony,
> > > > > > > > > > >
> > > > > > > > > > > Thanks for the reply.
> > > > > > > > > > >
> > > > > > > > > > > KW3: I don't think "count" metrics like the ones being
> > > > > discussed
> > > > > > in
> > > > > > > > > this
> > > > > > > > > > > KIP should report values when the objects they are
> > > associated
> > > > > > with
> > > > > > > do
> > > > > > > > > not
> > > > > > > > > > > exist. This would mean metric values at time X are not
> an
> > > > > > accurate
> > > > > > > > > > > "snapshot" of the Kafka node at time X. In my opinion,
> > > > > collecting
> > > > > > > the
> > > > > > > > > > > historic values for a metric, visualizing them through
> > > > > > dashboards,
> > > > > > > > and
> > > > > > > > > > > monitoring them to alert operators are the
> > responsibilities
> > > > of
> > > > > a
> > > > > > > > > > downstream
> > > > > > > > > > > observability software, not Kafka. Kafka does have the
> > > > > capability
> > > > > > > to
> > > > > > > > > > create
> > > > > > > > > > > "derivative" metrics (i.e. metrics whose values are
> based
> > > off
> > > > > of
> > > > > > > > > sampling
> > > > > > > > > > > something else over time) via Sensors, but I don't
> think
> > > that
> > > > > > fits
> > > > > > > > our
> > > > > > > > > > use
> > > > > > > > > > > case as previously discussed.
> > > > > > > > > > >
> > > > > > > > > > > Another way to think about it is that adding or
> removing
> > a
> > > > > metric
> > > > > > > so
> > > > > > > > > > Kafka
> > > > > > > > > > > starts or stops reporting a value to an observability
> > > service
> > > > > > > > actually
> > > > > > > > > > > tells the user more information about Kafka compared to
> > > > > > > > unconditionally
> > > > > > > > > > > reporting said metric value. Additionally, just because
> > you
> > > > > > remove
> > > > > > > a
> > > > > > > > > > metric
> > > > > > > > > > > does not mean you need to remove the corresponding
> > counter
> > > > > within
> > > > > > > the
> > > > > > > > > > > KafkaRaftMetrics object. For example, a node fetch
> > request
> > > > > times
> > > > > > > out
> > > > > > > > 5
> > > > > > > > > > > times as a follower, so the node reports 5 for the
> > metric.
> > > > > Next,
> > > > > > > the
> > > > > > > > > node
> > > > > > > > > > > becomes the leader, so it stops reporting the metric
> > (i.e.
> > > > the
> > > > > > > metric
> > > > > > > > > has
> > > > > > > > > > > no value). Then it becomes a follower, and starts
> > > reporting 5
> > > > > > > again.
> > > > > > > > > When
> > > > > > > > > > > the next fetch timeout happens, the node reports 6 for
> > the
> > > > > > metric.
> > > > > > > > What
> > > > > > > > > > do
> > > > > > > > > > > you think about this behavior?
> > > > > > > > > > >
> > > > > > > > > > > KW4: If the desire is for a metric that always reports
> a
> > > > value,
> > > > > > > what
> > > > > > > > do
> > > > > > > > > > you
> > > > > > > > > > > think about a metric that counts the number of
> > `EpochState`
> > > > > > > > > transitions?
> > > > > > > > > > I
> > > > > > > > > > > think this value makes sense to report this value for
> the
> > > > > > lifetime
> > > > > > > > of a
> > > > > > > > > > > process, and generally, frequent state transitions are
> an
> > > > > > > indication
> > > > > > > > > > > something is wrong with the cluster. This would be an
> > > > > additional
> > > > > > > > metric
> > > > > > > > > > to
> > > > > > > > > > > the ones we discussed above.
> > > > > > > > > > >
> > > > > > > > > > > Best,
> > > > > > > > > > > Kevin Wu
> > > > > > > > > > >
> > > > > > > > > > > On Fri, May 1, 2026 at 12:59 PM Tony Tang via dev <
> > > > > > > > > [email protected]>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Kevin,
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks for the reply. Very insightful points.
> > > > > > > > > > > >
> > > > > > > > > > > > KW1: Yes, using a single tagged metric makes sense.
> > it's
> > > > > > cleaner
> > > > > > > > and
> > > > > > > > > > more
> > > > > > > > > > > > extensible. I'll adopt this approach.
> > > > > > > > > > > >
> > > > > > > > > > > > KW2: Yes, we don't need to use `CumulativeCount`.
> > Already
> > > > > > updated
> > > > > > > > in
> > > > > > > > > > the
> > > > > > > > > > > > KIP
> > > > > > > > > > > >
> > > > > > > > > > > > KW3: I understand each timer is only meaningful in
> > > certain
> > > > > > > states,
> > > > > > > > > but
> > > > > > > > > > > the
> > > > > > > > > > > > metric value is still useful for operational
> monitoring
> > > > > > > regardless
> > > > > > > > of
> > > > > > > > > > the
> > > > > > > > > > > > current state. It tells you how many times a timeout
> > has
> > > > > > expired
> > > > > > > > over
> > > > > > > > > > the
> > > > > > > > > > > > lifetime of the node. Hiding or clearing the metric
> > when
> > > > the
> > > > > > node
> > > > > > > > > isn't
> > > > > > > > > > > in
> > > > > > > > > > > > the relevant state could actually make it harder for
> > > users
> > > > to
> > > > > > > > > diagnose
> > > > > > > > > > > > historical issues, since they'd need to catch the
> > metric
> > > > > while
> > > > > > > the
> > > > > > > > > node
> > > > > > > > > > > > happens to be in the right state. For example, if a
> > > > follower
> > > > > > had
> > > > > > > > > > repeated
> > > > > > > > > > > > fetch timeout expirations and then transitions to a
> > > > > > > > candidate/leader,
> > > > > > > > > > the
> > > > > > > > > > > > metrics would still be valuable for diagnosing why
> the
> > > > leader
> > > > > > > > > election
> > > > > > > > > > > > happened in the first place, right? If we cleared the
> > > > metric
> > > > > on
> > > > > > > > state
> > > > > > > > > > > > transition, that information would be lost. The
> > question
> > > > is :
> > > > > > Do
> > > > > > > we
> > > > > > > > > > only
> > > > > > > > > > > > want the metric to reflect only the latest state, or
> > the
> > > > > > overall
> > > > > > > > > > timeout
> > > > > > > > > > > > behavior over the node's lifetime? I lean toward the
> > > > latter,
> > > > > as
> > > > > > > it
> > > > > > > > > > > provides
> > > > > > > > > > > > more useful information for monitoring network
> issues.
> > To
> > > > > avoid
> > > > > > > > > > > confusion,
> > > > > > > > > > > > maybe we can use the metric name
> > lifetime-timeout-count +
> > > > tag
> > > > > > > > > > > > timer-name=fetch/election? What do you think?
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Apr 30, 2026 at 3:03 PM Kevin Wu <
> > > > > > [email protected]
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi Tony,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks for the KIP. I agree that having metrics for
> > > > > timeouts
> > > > > > in
> > > > > > > > > KRaft
> > > > > > > > > > > > would
> > > > > > > > > > > > > be a nice addition. I have a few high level
> comments
> > > > about
> > > > > > the
> > > > > > > > KIP:
> > > > > > > > > > > > >
> > > > > > > > > > > > > KW1: Did you consider making a tagged metric like
> > > > > > > > > > `number-of-timeouts`
> > > > > > > > > > > > > instead of individual metrics? You could tag by the
> > > timer
> > > > > > name
> > > > > > > > > (e.g.
> > > > > > > > > > > > fetch,
> > > > > > > > > > > > > election, update-voter, check-quorum, and
> > > > > begin-quorum-epoch
> > > > > > > > etc.)
> > > > > > > > > > > since
> > > > > > > > > > > > > KRaft supports several kinds of timers, and may add
> > > more
> > > > in
> > > > > > the
> > > > > > > > > > future.
> > > > > > > > > > > > You
> > > > > > > > > > > > > can look at `NodeMetrics.java` and
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1180*3A*Add*generic*feature*level*metrics__;JSsrKysr!!Ayb5sqE7!qPjsZ_186iR3QjEak9hmexMYOhwzDGzvcLYwnVUujYlAy2wAAQchfvSKMr9oG7Mygg608Vz6zFCv5QDQFYUcvow$
> > > > > > > > > > > > > for an example of tagged metrics using Kafka's new
> > > > metrics
> > > > > > > > > library. I
> > > > > > > > > > > > think
> > > > > > > > > > > > > there is an argument we should add timeout metrics
> > for
> > > > some
> > > > > > of
> > > > > > > > > these
> > > > > > > > > > > > other
> > > > > > > > > > > > > KRaft timers I mentioned, since reporting them
> could
> > > also
> > > > > > help
> > > > > > > > > > > operators
> > > > > > > > > > > > > diagnose network partitions or possible software
> > bugs.
> > > > > > > > > > > > >
> > > > > > > > > > > > > KW2: I see the "Type" for each metric is
> > > > > `CumulativeCount`. I
> > > > > > > > think
> > > > > > > > > > > this
> > > > > > > > > > > > > might be overkill, and that we could just use
> Integer
> > > for
> > > > > the
> > > > > > > > data
> > > > > > > > > > > type,
> > > > > > > > > > > > > and expose an increment method for each metric. In
> > > > general,
> > > > > > > > sensors
> > > > > > > > > > are
> > > > > > > > > > > > > used for when multiple metrics are associated with
> a
> > > > > specific
> > > > > > > > > concept
> > > > > > > > > > > > (e.g.
> > > > > > > > > > > > > `commit-latency-avg` and `commit-latency-max` are
> two
> > > > > > different
> > > > > > > > > > metrics
> > > > > > > > > > > > > associated with the same concept of "commit
> > latency").
> > > It
> > > > > is
> > > > > > > hard
> > > > > > > > > for
> > > > > > > > > > > me
> > > > > > > > > > > > to
> > > > > > > > > > > > > imagine that the number of timeouts occurring would
> > > have
> > > > > more
> > > > > > > > than
> > > > > > > > > > one
> > > > > > > > > > > > > metric associated with it.
> > > > > > > > > > > > >
> > > > > > > > > > > > > KW3: Each of these timers is associated with an
> > > > EpochState
> > > > > > > (e.g.
> > > > > > > > > the
> > > > > > > > > > > > fetch
> > > > > > > > > > > > > timer with FollowerState, check quorum timer with
> > > > > > LeaderState,
> > > > > > > > > etc.).
> > > > > > > > > > > > What
> > > > > > > > > > > > > should the value of these metrics be when a node
> > > > > transitions
> > > > > > > > > between
> > > > > > > > > > > > > EpochStates? Should we stop reporting the metrics
> > > > > associated
> > > > > > > with
> > > > > > > > > the
> > > > > > > > > > > old
> > > > > > > > > > > > > EpochState, and start reporting the metrics
> > associated
> > > > with
> > > > > > the
> > > > > > > > new
> > > > > > > > > > > > > EpochState? I personally think it might be
> confusing
> > if
> > > > > these
> > > > > > > > > metrics
> > > > > > > > > > > > > report values even if the underlying timer does not
> > > exist
> > > > > on
> > > > > > > the
> > > > > > > > > > node.
> > > > > > > > > > > > For
> > > > > > > > > > > > > example, the fetch timeout metric reporting a value
> > > when
> > > > > the
> > > > > > > > local
> > > > > > > > > > node
> > > > > > > > > > > > is
> > > > > > > > > > > > > the KRaft leader seems odd to me. When we added
> > metrics
> > > > for
> > > > > > > > KIP-853
> > > > > > > > > > > > > associated with the leader (e.g.
> > > > > `uncommitted-voter-change`),
> > > > > > > we
> > > > > > > > > > > decided
> > > > > > > > > > > > to
> > > > > > > > > > > > > only report values for those metrics when the local
> > > node
> > > > > was
> > > > > > > the
> > > > > > > > > > > leader.
> > > > > > > > > > > > It
> > > > > > > > > > > > > would be nice if we could follow that convention
> for
> > > > these
> > > > > > > > metrics
> > > > > > > > > > too,
> > > > > > > > > > > > and
> > > > > > > > > > > > > document which states report which metrics in the
> > KIP.
> > > > What
> > > > > > do
> > > > > > > > you
> > > > > > > > > > > think?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Best,
> > > > > > > > > > > > > Kevin Wu
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Tue, Apr 21, 2026 at 12:32 PM Tony Tang via dev
> <
> > > > > > > > > > > [email protected]
> > > > > > > > > > > > >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hello everyone,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I'd like to start a discussion on KIP-1322: Add
> > > metrics
> > > > > to
> > > > > > > > Kraft
> > > > > > > > > > that
> > > > > > > > > > > > > > measure the number of fetch timeouts and election
> > > > > timeouts
> > > > > > <
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1322*3A*Add*metrics*to*Kraft*that*measure*the*number*of*fetch*timeouts*and*election*timeouts__;JSsrKysrKysrKysrKysr!!Ayb5sqE7!qPjsZ_186iR3QjEak9hmexMYOhwzDGzvcLYwnVUujYlAy2wAAQchfvSKMr9oG7Mygg608Vz6zFCv5QDQLt1GBmw$
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > This proposal aims to add new metrics to KRaft
> that
> > > > track
> > > > > > how
> > > > > > > > > often
> > > > > > > > > > > > fetch
> > > > > > > > > > > > > > timeouts and election timeouts occur.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > > Tony Tang
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-1322: Add metrics to Kraft that measure the number of fetch timeouts and election timeouts

Reply via email to