Hey Tony, Sorry for the delay on the response. Thanks for sharing the metrics we have in Kafka.
I think given the approaches we chose, it could be useful to include a section in the KIP for rejected alternatives with a summary of things we considered, but chose not to implement. That can be useful to folks who may not follow this discussion thread, but may have questions about certain design choices. Things like why gauge vs. meter and maybe some of the discussion above about whether to reset the values given the state. Thanks, Justine On Thu, May 14, 2026 at 10:55 AM Tony Tang via dev <[email protected]> wrote: > Hi Kevin, > > I think we are aligned that the metric should > > increment in value when its corresponding timer goes from unexpired to > > expired during a poll of the raft client. It should not increment in > value > > if the timer goes from expired to expired during a poll of the raft > client. > > > Yeah I agree with that. I'll add this point to the KIP > > Best, > Tony > > On Thu, May 14, 2026 at 12:09 PM Kevin Wu <[email protected]> wrote: > > > Hi Tony, > > > > Yeah we should always report these metrics. When/how to increment them is > > an implementation detail, but I think we are aligned that the metric > should > > increment in value when its corresponding timer goes from unexpired to > > expired during a poll of the raft client. It should not increment in > value > > if the timer goes from expired to expired during a poll of the raft > client. > > What do you think? > > > > Best, > > Kevin Wu > > > > On Wed, May 13, 2026 at 7:24 PM Tony Tang via dev <[email protected]> > > wrote: > > > > > Hi Kevin, > > > > > > Re KW3: That makes sense to me. In particular, if we only register the > > > election timeout metric during Unattached/Prospective/Candidate, those > > are > > > all short states. After a timeout the node changes status, and the > metric > > > gets unregistered. Since nodes spend most of their time in Follower or > > > Leader, operators may never see this metric at all. So yeah, let's just > > > always report these metrics. > > > > > > Re KW8: > > > > > > Implementing this like case 1 could diverge from the meaning of the > > metric > > > > in some cases. For example, expiring the fetch timeout for a voter > > always > > > > causes a transition to Prospective, so this metric is not updated on > > the > > > > next poll(). It also stops getting reported. > > > > > > Since we decide not registering/unregistering metrics based on state, > the > > > above concerns disappear, right? > > > > > > observers do not transition to another state > > > > when its fetch timeout expires, so every subsequent poll would > increase > > > > this metric's value, which is technically wrong based on what we > > defined > > > in > > > > the KIP. > > > > > > Okay, I understand the observer concern. Let's assume this KIP will be > > > implemented after KAFKA-20514, so the observer fetch timeout behavior > > will > > > already be fixed by then. > > > > > > In summary: > > > 1. we always report the metrics. > > > 2. we only increment the counter in the poll methods when isExpired() > is > > > called, as we originally planned. Does that work for you? > > > > > > Thanks, > > > Tony > > > > > > On Wed, May 13, 2026 at 12:10 PM Kevin Wu <[email protected]> > > wrote: > > > > > > > Hi Tony, > > > > > > > > KW3: Sorry for going back and forth on this, but one issue I see with > > not > > > > reporting these metrics when the node's current EpochState does not > > > contain > > > > the timer is that downstream observability platforms may not be able > to > > > > capture these values very well. For example, the fetch timeout can > only > > > > expire once on a FollowerAsVoter before it transitions to > Prospective, > > > and > > > > that value would not be reported/visible to operators. Only upon > > > returning > > > > to FollowerState would the operator see the fetch timeout metric > again, > > > > which cannot increment without first not reporting any value. That is > > not > > > > intuitive metric behavior to me, as the operator would see the value > > "too > > > > late". > > > > > > > > In summary, I think it is fine to always report these metrics, and if > > it > > > > turns out that is wrong or confusing, that's an implementation detail > > we > > > > can fix later. I think I was trying to draw too many parallels with > the > > > > KIP-853 leader metrics I implemented previously. Leader and Follower > > are > > > > "long-lived" states so I think it makes sense to report metrics which > > > only > > > > change in these states, during these states. Timer expirations > > > specifically > > > > cause transitions to and between more "intermittent" states (e.g. > > > > Prospective, Candidate, etc.), and therefore occur at the "end" of an > > > > EpochState, so registering and unregistering them does not seem like > a > > > good > > > > idea. > > > > > > > > KW8: Yeah, basically. Implementing this like case 1 could diverge > from > > > the > > > > meaning of the metric in some cases. For example, expiring the fetch > > > > timeout for a voter always causes a transition to Prospective, so > this > > > > metric is not updated on the next poll(). It also stops getting > > reported. > > > > However, in the current code, observers do not transition to another > > > state > > > > when its fetch timeout expires, so every subsequent poll would > increase > > > > this metric's value, which is technically wrong based on what we > > defined > > > in > > > > the KIP. > > > > > > > > > > https://urldefense.com/v3/__https://issues.apache.org/jira/browse/KAFKA-20514__;!!Ayb5sqE7!qUcUv6MZLg9t1tNcXiCJCDe1vw-YajXBZeS9xD-Un3niEylSMbS3gHZyccO2mgQha4feqEvYJSE5sbQTnAhaooA$ > > > > fixes the issue > > > > for the fetch timer specifically. > > > > > > > > This comment specifically is just something I wanted to make you > aware > > of > > > > during your implementation, since it was not super obvious to me that > > > this > > > > interaction with poll(), EpochState transitions, and our proposed > > metrics > > > > occurred until a deeper look at the code. > > > > > > > > Best, > > > > Kevin Wu > > > > > > > > On Wed, May 13, 2026 at 8:51 AM Tony Tang via dev < > > [email protected]> > > > > wrote: > > > > > > > > > Hi Kevin, > > > > > > > > > > Thanks for the feedback. > > > > > Re KW 5,6,7: I agree with all three. I'll update the KIP > > > > > Re KW 8: > > > > > Case 1: The the poll methods in KafkaRaftClient actively calls > > > > isExpired() > > > > > on the timer, and when it finds the timer has expired, it > increments > > > the > > > > > counter. The counting logic lives in the caller > > > > > Case 2: The timer itself is aware of its expiration and > automatically > > > > > increments the counter when it transitions from not expired to > > expired. > > > > The > > > > > counting logic lives inside the timer. > > > > > Is that the cases you're talking about? > > > > > > > > > > Best, > > > > > Tony > > > > > > > > > > > > > > > On Tue, May 12, 2026 at 5:42 PM Kevin Wu <[email protected]> > > > wrote: > > > > > > > > > > > Hi Tony, > > > > > > > > > > > > Thanks for the updates to the KIP. I have a few minor comments: > > > > > > > > > > > > KW5: nit, but can we say “this metric is registered only…” > instead > > of > > > > > > “Registered only…” in the description column? > > > > > > > > > > > > KW6: What do you think about renaming the metric to > > > > > “timeout-expirations”? > > > > > > This would be slightly less verbose than what we have now. > > > > > > > > > > > > KW7: Instead of having separate columns for the metric tag and > > name, > > > > can > > > > > we > > > > > > have one column with the entire MBean name? For example: > > > > > > > > > > > > > > > > > > > > > > > > > > > “kafka.server:type=node-metrics,name=maximum-supported-level,feature-name=X”. > > > > > > > > > > > > KW8: Something to consider in determining when these metric > values > > > > should > > > > > > be incremented: should the value be incremented when the raft > > client > > > > > reads > > > > > > the timer and finds it has expired? Or, should the value be > > > incremented > > > > > > only when the timer goes from not expired to expired? The first > is > > > > simple > > > > > > but means the metric value is slightly different than what we’ve > > > > > described > > > > > > in the KIP. I feel like implementing the second may be pretty > > > complex… > > > > > > > > > > > > Best, > > > > > > Kevin Wu > > > > > > > > > > > > On Fri, May 8, 2026 at 10:06 AM Tony Tang via dev < > > > > [email protected]> > > > > > > wrote: > > > > > > > > > > > > > Hi Justine, > > > > > > > > > > > > > > Thanks for the feedback! > > > > > > > I looked through the Kafka codebase and found a couple of > > existing > > > > > > timeout > > > > > > > metrics: > > > > > > > > > > > > > > 1.worker-poll-timeout-count in Kafka Connect, which is a > > cumulative > > > > > count > > > > > > > of poll timeouts, implemented as a Gauge using an AtomicLong > > > counter. > > > > > > This > > > > > > > is the closest to what we're proposing. > > > > > > > 2. AcquisitionLockTimeoutPerSec in Share Groups, which is a > Meter > > > > that > > > > > > > tracks the rate of acquisition lock timeouts. > > > > > > > > > > > > > > I think our KIP follows a similar approach to > > > > worker-poll-timeout-count > > > > > > > > > > > > > > Thanks, > > > > > > > Tony > > > > > > > > > > > > > > On Thu, May 7, 2026 at 5:21 PM Justine Olshan via dev < > > > > > > > [email protected]> > > > > > > > wrote: > > > > > > > > > > > > > > > Hey Tony, > > > > > > > > > > > > > > > > Thanks for the KIP. Overall, I think the idea makes sense and > > it > > > > > seems > > > > > > > like > > > > > > > > you and Kevin are getting closer to agreement on the exact > > > > definition > > > > > > of > > > > > > > > the metrics. > > > > > > > > I was curious, are there any other timeout metrics you can > find > > > in > > > > > > Kafka > > > > > > > > and how are those defined? We don't necessarily need to do > the > > > > same, > > > > > > but > > > > > > > > was curious if there was any precedent for this type of > metric. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Justine > > > > > > > > > > > > > > > > On Mon, May 4, 2026 at 9:25 AM Tony Tang via dev < > > > > > [email protected] > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Hi Kevin, > > > > > > > > > > > > > > > > > > Thanks for the reply. > > > > > > > > > > > > > > > > > > KW3: Ok, I agree that metric values at time X should be an > > > > accurate > > > > > > > > > snapshot of the node at time X, and that collecting > > historical > > > > > values > > > > > > > is > > > > > > > > > not the responsibility of kafka. The approach you described > > > makes > > > > > > sense > > > > > > > > to > > > > > > > > > me: we keep the internal counter across state transitions, > > but > > > > only > > > > > > > > > register/unregister the metric based on whether the > > underlying > > > > > timer > > > > > > > > > exists. I'm on board with that approach > > > > > > > > > > > > > > > > > > KW4: That's a great suggestion. However, I think it's out > of > > > > scope > > > > > > for > > > > > > > > this > > > > > > > > > KIP. I'd prefer to keep this KIP focused on timeout > > expiration > > > > > > metrics > > > > > > > > and > > > > > > > > > maybe we can discuss the state transition counting in the > > > future. > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > Tony > > > > > > > > > > > > > > > > > > On Fri, May 1, 2026 at 5:12 PM Kevin Wu < > > > [email protected]> > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Hi Tony, > > > > > > > > > > > > > > > > > > > > Thanks for the reply. > > > > > > > > > > > > > > > > > > > > KW3: I don't think "count" metrics like the ones being > > > > discussed > > > > > in > > > > > > > > this > > > > > > > > > > KIP should report values when the objects they are > > associated > > > > > with > > > > > > do > > > > > > > > not > > > > > > > > > > exist. This would mean metric values at time X are not an > > > > > accurate > > > > > > > > > > "snapshot" of the Kafka node at time X. In my opinion, > > > > collecting > > > > > > the > > > > > > > > > > historic values for a metric, visualizing them through > > > > > dashboards, > > > > > > > and > > > > > > > > > > monitoring them to alert operators are the > responsibilities > > > of > > > > a > > > > > > > > > downstream > > > > > > > > > > observability software, not Kafka. Kafka does have the > > > > capability > > > > > > to > > > > > > > > > create > > > > > > > > > > "derivative" metrics (i.e. metrics whose values are based > > off > > > > of > > > > > > > > sampling > > > > > > > > > > something else over time) via Sensors, but I don't think > > that > > > > > fits > > > > > > > our > > > > > > > > > use > > > > > > > > > > case as previously discussed. > > > > > > > > > > > > > > > > > > > > Another way to think about it is that adding or removing > a > > > > metric > > > > > > so > > > > > > > > > Kafka > > > > > > > > > > starts or stops reporting a value to an observability > > service > > > > > > > actually > > > > > > > > > > tells the user more information about Kafka compared to > > > > > > > unconditionally > > > > > > > > > > reporting said metric value. Additionally, just because > you > > > > > remove > > > > > > a > > > > > > > > > metric > > > > > > > > > > does not mean you need to remove the corresponding > counter > > > > within > > > > > > the > > > > > > > > > > KafkaRaftMetrics object. For example, a node fetch > request > > > > times > > > > > > out > > > > > > > 5 > > > > > > > > > > times as a follower, so the node reports 5 for the > metric. > > > > Next, > > > > > > the > > > > > > > > node > > > > > > > > > > becomes the leader, so it stops reporting the metric > (i.e. > > > the > > > > > > metric > > > > > > > > has > > > > > > > > > > no value). Then it becomes a follower, and starts > > reporting 5 > > > > > > again. > > > > > > > > When > > > > > > > > > > the next fetch timeout happens, the node reports 6 for > the > > > > > metric. > > > > > > > What > > > > > > > > > do > > > > > > > > > > you think about this behavior? > > > > > > > > > > > > > > > > > > > > KW4: If the desire is for a metric that always reports a > > > value, > > > > > > what > > > > > > > do > > > > > > > > > you > > > > > > > > > > think about a metric that counts the number of > `EpochState` > > > > > > > > transitions? > > > > > > > > > I > > > > > > > > > > think this value makes sense to report this value for the > > > > > lifetime > > > > > > > of a > > > > > > > > > > process, and generally, frequent state transitions are an > > > > > > indication > > > > > > > > > > something is wrong with the cluster. This would be an > > > > additional > > > > > > > metric > > > > > > > > > to > > > > > > > > > > the ones we discussed above. > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > Kevin Wu > > > > > > > > > > > > > > > > > > > > On Fri, May 1, 2026 at 12:59 PM Tony Tang via dev < > > > > > > > > [email protected]> > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Hi Kevin, > > > > > > > > > > > > > > > > > > > > > > Thanks for the reply. Very insightful points. > > > > > > > > > > > > > > > > > > > > > > KW1: Yes, using a single tagged metric makes sense. > it's > > > > > cleaner > > > > > > > and > > > > > > > > > more > > > > > > > > > > > extensible. I'll adopt this approach. > > > > > > > > > > > > > > > > > > > > > > KW2: Yes, we don't need to use `CumulativeCount`. > Already > > > > > updated > > > > > > > in > > > > > > > > > the > > > > > > > > > > > KIP > > > > > > > > > > > > > > > > > > > > > > KW3: I understand each timer is only meaningful in > > certain > > > > > > states, > > > > > > > > but > > > > > > > > > > the > > > > > > > > > > > metric value is still useful for operational monitoring > > > > > > regardless > > > > > > > of > > > > > > > > > the > > > > > > > > > > > current state. It tells you how many times a timeout > has > > > > > expired > > > > > > > over > > > > > > > > > the > > > > > > > > > > > lifetime of the node. Hiding or clearing the metric > when > > > the > > > > > node > > > > > > > > isn't > > > > > > > > > > in > > > > > > > > > > > the relevant state could actually make it harder for > > users > > > to > > > > > > > > diagnose > > > > > > > > > > > historical issues, since they'd need to catch the > metric > > > > while > > > > > > the > > > > > > > > node > > > > > > > > > > > happens to be in the right state. For example, if a > > > follower > > > > > had > > > > > > > > > repeated > > > > > > > > > > > fetch timeout expirations and then transitions to a > > > > > > > candidate/leader, > > > > > > > > > the > > > > > > > > > > > metrics would still be valuable for diagnosing why the > > > leader > > > > > > > > election > > > > > > > > > > > happened in the first place, right? If we cleared the > > > metric > > > > on > > > > > > > state > > > > > > > > > > > transition, that information would be lost. The > question > > > is : > > > > > Do > > > > > > we > > > > > > > > > only > > > > > > > > > > > want the metric to reflect only the latest state, or > the > > > > > overall > > > > > > > > > timeout > > > > > > > > > > > behavior over the node's lifetime? I lean toward the > > > latter, > > > > as > > > > > > it > > > > > > > > > > provides > > > > > > > > > > > more useful information for monitoring network issues. > To > > > > avoid > > > > > > > > > > confusion, > > > > > > > > > > > maybe we can use the metric name > lifetime-timeout-count + > > > tag > > > > > > > > > > > timer-name=fetch/election? What do you think? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Apr 30, 2026 at 3:03 PM Kevin Wu < > > > > > [email protected] > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > Hi Tony, > > > > > > > > > > > > > > > > > > > > > > > > Thanks for the KIP. I agree that having metrics for > > > > timeouts > > > > > in > > > > > > > > KRaft > > > > > > > > > > > would > > > > > > > > > > > > be a nice addition. I have a few high level comments > > > about > > > > > the > > > > > > > KIP: > > > > > > > > > > > > > > > > > > > > > > > > KW1: Did you consider making a tagged metric like > > > > > > > > > `number-of-timeouts` > > > > > > > > > > > > instead of individual metrics? You could tag by the > > timer > > > > > name > > > > > > > > (e.g. > > > > > > > > > > > fetch, > > > > > > > > > > > > election, update-voter, check-quorum, and > > > > begin-quorum-epoch > > > > > > > etc.) > > > > > > > > > > since > > > > > > > > > > > > KRaft supports several kinds of timers, and may add > > more > > > in > > > > > the > > > > > > > > > future. > > > > > > > > > > > You > > > > > > > > > > > > can look at `NodeMetrics.java` and > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1180*3A*Add*generic*feature*level*metrics__;JSsrKysr!!Ayb5sqE7!qPjsZ_186iR3QjEak9hmexMYOhwzDGzvcLYwnVUujYlAy2wAAQchfvSKMr9oG7Mygg608Vz6zFCv5QDQFYUcvow$ > > > > > > > > > > > > for an example of tagged metrics using Kafka's new > > > metrics > > > > > > > > library. I > > > > > > > > > > > think > > > > > > > > > > > > there is an argument we should add timeout metrics > for > > > some > > > > > of > > > > > > > > these > > > > > > > > > > > other > > > > > > > > > > > > KRaft timers I mentioned, since reporting them could > > also > > > > > help > > > > > > > > > > operators > > > > > > > > > > > > diagnose network partitions or possible software > bugs. > > > > > > > > > > > > > > > > > > > > > > > > KW2: I see the "Type" for each metric is > > > > `CumulativeCount`. I > > > > > > > think > > > > > > > > > > this > > > > > > > > > > > > might be overkill, and that we could just use Integer > > for > > > > the > > > > > > > data > > > > > > > > > > type, > > > > > > > > > > > > and expose an increment method for each metric. In > > > general, > > > > > > > sensors > > > > > > > > > are > > > > > > > > > > > > used for when multiple metrics are associated with a > > > > specific > > > > > > > > concept > > > > > > > > > > > (e.g. > > > > > > > > > > > > `commit-latency-avg` and `commit-latency-max` are two > > > > > different > > > > > > > > > metrics > > > > > > > > > > > > associated with the same concept of "commit > latency"). > > It > > > > is > > > > > > hard > > > > > > > > for > > > > > > > > > > me > > > > > > > > > > > to > > > > > > > > > > > > imagine that the number of timeouts occurring would > > have > > > > more > > > > > > > than > > > > > > > > > one > > > > > > > > > > > > metric associated with it. > > > > > > > > > > > > > > > > > > > > > > > > KW3: Each of these timers is associated with an > > > EpochState > > > > > > (e.g. > > > > > > > > the > > > > > > > > > > > fetch > > > > > > > > > > > > timer with FollowerState, check quorum timer with > > > > > LeaderState, > > > > > > > > etc.). > > > > > > > > > > > What > > > > > > > > > > > > should the value of these metrics be when a node > > > > transitions > > > > > > > > between > > > > > > > > > > > > EpochStates? Should we stop reporting the metrics > > > > associated > > > > > > with > > > > > > > > the > > > > > > > > > > old > > > > > > > > > > > > EpochState, and start reporting the metrics > associated > > > with > > > > > the > > > > > > > new > > > > > > > > > > > > EpochState? I personally think it might be confusing > if > > > > these > > > > > > > > metrics > > > > > > > > > > > > report values even if the underlying timer does not > > exist > > > > on > > > > > > the > > > > > > > > > node. > > > > > > > > > > > For > > > > > > > > > > > > example, the fetch timeout metric reporting a value > > when > > > > the > > > > > > > local > > > > > > > > > node > > > > > > > > > > > is > > > > > > > > > > > > the KRaft leader seems odd to me. When we added > metrics > > > for > > > > > > > KIP-853 > > > > > > > > > > > > associated with the leader (e.g. > > > > `uncommitted-voter-change`), > > > > > > we > > > > > > > > > > decided > > > > > > > > > > > to > > > > > > > > > > > > only report values for those metrics when the local > > node > > > > was > > > > > > the > > > > > > > > > > leader. > > > > > > > > > > > It > > > > > > > > > > > > would be nice if we could follow that convention for > > > these > > > > > > > metrics > > > > > > > > > too, > > > > > > > > > > > and > > > > > > > > > > > > document which states report which metrics in the > KIP. > > > What > > > > > do > > > > > > > you > > > > > > > > > > think? > > > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > > Kevin Wu > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Apr 21, 2026 at 12:32 PM Tony Tang via dev < > > > > > > > > > > [email protected] > > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > Hello everyone, > > > > > > > > > > > > > > > > > > > > > > > > > > I'd like to start a discussion on KIP-1322: Add > > metrics > > > > to > > > > > > > Kraft > > > > > > > > > that > > > > > > > > > > > > > measure the number of fetch timeouts and election > > > > timeouts > > > > > < > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1322*3A*Add*metrics*to*Kraft*that*measure*the*number*of*fetch*timeouts*and*election*timeouts__;JSsrKysrKysrKysrKysr!!Ayb5sqE7!qPjsZ_186iR3QjEak9hmexMYOhwzDGzvcLYwnVUujYlAy2wAAQchfvSKMr9oG7Mygg608Vz6zFCv5QDQLt1GBmw$ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This proposal aims to add new metrics to KRaft that > > > track > > > > > how > > > > > > > > often > > > > > > > > > > > fetch > > > > > > > > > > > > > timeouts and election timeouts occur. > > > > > > > > > > > > > > > > > > > > > > > > > > Best regards, > > > > > > > > > > > > > Tony Tang > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
