Re: [DISCUSS] KIP-1216: Add rebalance listener metrics for Kafka Streams

Matthias J. Sax Mon, 29 Sep 2025 11:25:48 -0700

Thanks for the KIP Travis.

The 3 new latency metrics sound very useful.

For the 4 new debug metrics: they sound somewhat redundant to existingmetrics:

 - task-created-rate
 - task-created-total
 - task-closed-rate
 - task-closed-total

Are you aware that these metrics already exist? I don't see, why theywould not work if "streams" rebalance protocol gets used?

Btw: I was always wondering about the usefulness of the two `-total`metrics? How is it useful to know, for a long running application, howmany tasks got created or closed during the whole lifetime of aStreamsThread?

It could be useful though, to know the number of created/revoked/losttask of the last rebalance, ie, we would use a gauge instead of a summetric?

Splitting out active/standby/warmup as proposed by Lucas sounds useful,too. So maybe we could deprecate the existing metrics, and replace withbetter one?

What is the reason to split out active/standby (and warmup) for the"assigned" case, but not the revoked or lost case?

I don't think we should add task-ids to metrics personally. If usersneed to access this information, it might be better to add somecallback/listener they can register on `KafkaStreams` -- but even forthis, I am not sure how useful it would be? Was any user reporting thatit would be useful?



-Matthias

On 9/16/25 2:14 AM, Lucas Brutschy wrote:

Hi Travis,

thanks for the KIP!

Looks good to me. I'm not sure, we need the DEBUG metrics, but we can
add them. I would, however, already also include warm-up tasks in the
metrics, if you are including active / standby. Furthermore, I also
wasn't sure if Bill wanted to add the number of tasks or the actual
task IDs to the DEBUG metrics. Bill, maybe you can comment on that.

I think after hashing out these finer points about the DEBUG metrics,
we can open a vote thread.

Cheers,
Lucas

On Mon, Sep 15, 2025 at 6:38 AM Travis Zhang
<[email protected]> wrote:


Hi Bill,

Thanks for your feedback. It does make sense to me. I've added the
total task count metrics to the KIP at DEBUG level!

Best,
Travis

On Fri, Sep 12, 2025 at 11:44 AM Bill Bejeck <[email protected]> wrote:


Hi Travis,

Thanks for the KIP! It looks like a useful addition in support of KIP-1017.
Overall the KIP LGTM, but I have a follow-up question.

Would we want to consider an additional metric displaying the tasks
involved in each of the revoked, assigned, and lost events?
This would probably be best at the DEBUG level.
Certainly this is an optional suggestion, but I do feel it would be a
valuable aid to operators of KS applications.

Regards,
Bill

On Fri, Sep 12, 2025 at 12:04 AM Travis Zhang <[email protected]>
wrote:

Hey Alieh,

Thanks for the great questions and the thoughtful feedback on the KIP!

Good call on adding the code snippets—I'll get the key class
structures into the KIP to make it fully self-contained.

You raised some excellent points on the metrics strategy. Here’s my
thinking on them:

1. Why Thread-Level Metrics:

We opted for thread-level reporting for two main reasons:
debuggability and consistency. When a rebalance gets stuck, operators
need to pinpoint exactly which StreamThread is the bottleneck, as each
one can have a very different workload. This approach also aligns with
all other core metrics (like process-latency), which are already
scoped to the thread.

While it is possible to add application-level aggregates, they
wouldn't offer new insights since any application-wide issue will
always show up in one or more threads. I felt this gives operators the
most diagnostic power without adding noise.

2. Avg/Max vs. Percentiles:

On using avg/max, I think avg/max is good for now, mainly because of
the nature of rebalances. They're infrequent but high-impact events.
Unlike a constant stream of processing operations, a single slow
rebalance is the production issue, making max latency the most
critical signal for an operator.

Percentiles are less statistically meaningful for such low-frequency
events and introduce a memory overhead we'd like to avoid initially.

We can definitely consider adding percentiles in a future KIP if we
find avg/max isn't sufficient once this is in production.

Let me know if this reasoning makes sense. Happy to discuss it more!

Best,
Travis


On Thu, Sep 11, 2025 at 8:53 AM Alieh Saeedi
<[email protected]> wrote:


Hey Travis

Thanks for sharing the KIP.

One suggestion (not essential): would it be possible to include the
relevant code snippet and the new class directly in the KIP in `Proposed
Changes` section? That way, everything is self-contained and there’s no
need to switch between the KIP and the codebase.
I understand that you’re incorporating the existing metrics from the old
protocol into the new one, with the goal of maintaining consistency in

the

metrics provided. However, I still have a few questions that might be

best

addressed here, as this seems like the ideal time to raise them and
reconsider our approach.
-

1. Why are the new metrics being recorded at the thread level

exclusively?

Would there be value in exposing these metrics at additional levels (such
as application), especially for operators managing large topologies?
-

2. Are the chosen latency metrics—average and max—sufficient for

diagnosing

issues in production, or should more granular statistics (e.g.,

percentile

latencies) be considered to improve observability?

Let me know your thoughts!


Bests,
Alieh

On Wed, Sep 10, 2025 at 7:38 PM Travis Zhang <[email protected]

wrote:

Hi,

I'd like to start a discussion on
KIP-1216:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-1216%3A+Add+rebalance+listener+metrics+for+Kafka+Streams


This KIP proposes adding latency metrics for each rebalance callback
to provide operators with the observability needed to effectively
monitor and optimize Kafka Streams applications in production
environments.

Thanks,
Travis

Re: [DISCUSS] KIP-1216: Add rebalance listener metrics for Kafka Streams

Reply via email to