Re: [DISCUSS] KIP-1216: Add rebalance listener metrics for Kafka Streams

Travis Zhang Fri, 17 Oct 2025 14:34:22 -0700

Hi Bill, Lucas, Matthias, thanks a lot for reviewing and discussing
the KIP. Based on our discussion, I've revert the KIP to the version
that only focus on the latency metrics.


On Tue, Sep 30, 2025 at 9:45 AM Bill Bejeck <[email protected]> wrote:
>
> Hey Lucas,
>
> Thinking about the original intent of the KIP, I agree it would be better
> to drop the DEBUG metrics suggestion and have it as follow-on work.
>
> -Bill
>
> On Tue, Sep 30, 2025 at 4:01 AM Lucas Brutschy
> <[email protected]> wrote:
>
> > Hi Matthias / Bill,
> >
> > it's a good point that there is an overlap between the debug metrics
> > and task-created-rate/task-created-total.
> >
> > I wonder if we are overloading this KIP with the DEBUG metrics that
> > Bill suggested. The main point of the KIP is to capture the latency of
> > revoking tasks and handling assignment of tasks. Understanding the
> > latency of revoking / assigning tasks inside streams is important to
> > understand the duration of rebalances.
> >
> > Capturing number of tasks assigned / revoked / lost, deprecating
> > task-closed/task-closed-total metrics etc., and potentially including
> > revoked / added tasks IDs into metrics, all sound useful but a bit
> > orthogonal to the original point of the KIP. Could we leave this to
> > future work?
> >
> > Cheers,
> > Lucas
> >
> > On Tue, Sep 30, 2025 at 2:02 AM Kirk True <[email protected]> wrote:
> > >
> > > Hi Travis,
> > >
> > > Thanks for the KIP!
> > >
> > > No comments on the KIP, per se, but I'm glad I read it because I don't
> > remember ever hearing about metrics recording levels before. I'll
> > definitely plan to put those into use for some of the metrics we added
> > recently to the consumer.
> > >
> > > Thanks,
> > > Kirk
> > >
> > > On Mon, Sep 29, 2025, at 11:24 AM, Matthias J. Sax wrote:
> > > > Thanks for the KIP Travis.
> > > >
> > > > The 3 new latency metrics sound very useful.
> > > >
> > > > For the 4 new debug metrics: they sound somewhat redundant to existing
> > > > metrics:
> > > >   - task-created-rate
> > > >   - task-created-total
> > > >   - task-closed-rate
> > > >   - task-closed-total
> > > >
> > > > Are you aware that these metrics already exist? I don't see, why they
> > > > would not work if "streams" rebalance protocol gets used?
> > > >
> > > > Btw: I was always wondering about the usefulness of the two `-total`
> > > > metrics? How is it useful to know, for a long running application, how
> > > > many tasks got created or closed during the whole lifetime of a
> > > > StreamsThread?
> > > >
> > > > It could be useful though, to know the number of created/revoked/lost
> > > > task of the last rebalance, ie, we would use a gauge instead of a sum
> > > > metric?
> > > >
> > > > Splitting out active/standby/warmup as proposed by Lucas sounds useful,
> > > > too. So maybe we could deprecate the existing metrics, and replace with
> > > > better one?
> > > >
> > > > What is the reason to split out active/standby (and warmup) for the
> > > > "assigned" case, but not the revoked or lost case?
> > > >
> > > >
> > > > I don't think we should add task-ids to metrics personally. If users
> > > > need to access this information, it might be better to add some
> > > > callback/listener they can register on `KafkaStreams` -- but even for
> > > > this, I am not sure how useful it would be? Was any user reporting that
> > > > it would be useful?
> > > >
> > > >
> > > > -Matthias
> > > >
> > > > On 9/16/25 2:14 AM, Lucas Brutschy wrote:
> > > > > Hi Travis,
> > > > >
> > > > > thanks for the KIP!
> > > > >
> > > > > Looks good to me. I'm not sure, we need the DEBUG metrics, but we can
> > > > > add them. I would, however, already also include warm-up tasks in the
> > > > > metrics, if you are including active / standby. Furthermore, I also
> > > > > wasn't sure if Bill wanted to add the number of tasks or the actual
> > > > > task IDs to the DEBUG metrics. Bill, maybe you can comment on that.
> > > > >
> > > > > I think after hashing out these finer points about the DEBUG metrics,
> > > > > we can open a vote thread.
> > > > >
> > > > > Cheers,
> > > > > Lucas
> > > > >
> > > > > On Mon, Sep 15, 2025 at 6:38 AM Travis Zhang
> > > > > <[email protected]> wrote:
> > > > >>
> > > > >> Hi Bill,
> > > > >>
> > > > >> Thanks for your feedback. It does make sense to me. I've added the
> > > > >> total task count metrics to the KIP at DEBUG level!
> > > > >>
> > > > >> Best,
> > > > >> Travis
> > > > >>
> > > > >> On Fri, Sep 12, 2025 at 11:44 AM Bill Bejeck <[email protected]>
> > wrote:
> > > > >>>
> > > > >>> Hi Travis,
> > > > >>>
> > > > >>> Thanks for the KIP! It looks like a useful addition in support of
> > KIP-1017.
> > > > >>> Overall the KIP LGTM, but I have a follow-up question.
> > > > >>>
> > > > >>> Would we want to consider an additional metric displaying the tasks
> > > > >>> involved in each of the revoked, assigned, and lost events?
> > > > >>> This would probably be best at the DEBUG level.
> > > > >>> Certainly this is an optional suggestion, but I do feel it would
> > be a
> > > > >>> valuable aid to operators of KS applications.
> > > > >>>
> > > > >>> Regards,
> > > > >>> Bill
> > > > >>>
> > > > >>> On Fri, Sep 12, 2025 at 12:04 AM Travis Zhang
> > <[email protected]>
> > > > >>> wrote:
> > > > >>>
> > > > >>>> Hey Alieh,
> > > > >>>>
> > > > >>>> Thanks for the great questions and the thoughtful feedback on the
> > KIP!
> > > > >>>>
> > > > >>>> Good call on adding the code snippets—I'll get the key class
> > > > >>>> structures into the KIP to make it fully self-contained.
> > > > >>>>
> > > > >>>> You raised some excellent points on the metrics strategy. Here’s
> > my
> > > > >>>> thinking on them:
> > > > >>>>
> > > > >>>> 1. Why Thread-Level Metrics:
> > > > >>>>
> > > > >>>> We opted for thread-level reporting for two main reasons:
> > > > >>>> debuggability and consistency. When a rebalance gets stuck,
> > operators
> > > > >>>> need to pinpoint exactly which StreamThread is the bottleneck, as
> > each
> > > > >>>> one can have a very different workload. This approach also aligns
> > with
> > > > >>>> all other core metrics (like process-latency), which are already
> > > > >>>> scoped to the thread.
> > > > >>>>
> > > > >>>> While it is possible to add application-level aggregates, they
> > > > >>>> wouldn't offer new insights since any application-wide issue will
> > > > >>>> always show up in one or more threads. I felt this gives
> > operators the
> > > > >>>> most diagnostic power without adding noise.
> > > > >>>>
> > > > >>>> 2. Avg/Max vs. Percentiles:
> > > > >>>>
> > > > >>>> On using avg/max, I think avg/max is good for now, mainly because
> > of
> > > > >>>> the nature of rebalances. They're infrequent but high-impact
> > events.
> > > > >>>> Unlike a constant stream of processing operations, a single slow
> > > > >>>> rebalance is the production issue, making max latency the most
> > > > >>>> critical signal for an operator.
> > > > >>>>
> > > > >>>> Percentiles are less statistically meaningful for such
> > low-frequency
> > > > >>>> events and introduce a memory overhead we'd like to avoid
> > initially.
> > > > >>>>
> > > > >>>> We can definitely consider adding percentiles in a future KIP if
> > we
> > > > >>>> find avg/max isn't sufficient once this is in production.
> > > > >>>>
> > > > >>>> Let me know if this reasoning makes sense. Happy to discuss it
> > more!
> > > > >>>>
> > > > >>>> Best,
> > > > >>>> Travis
> > > > >>>>
> > > > >>>>
> > > > >>>> On Thu, Sep 11, 2025 at 8:53 AM Alieh Saeedi
> > > > >>>> <[email protected]> wrote:
> > > > >>>>>
> > > > >>>>> Hey Travis
> > > > >>>>>
> > > > >>>>> Thanks for sharing the KIP.
> > > > >>>>>
> > > > >>>>> One suggestion (not essential): would it be possible to include
> > the
> > > > >>>>> relevant code snippet and the new class directly in the KIP in
> > `Proposed
> > > > >>>>> Changes` section? That way, everything is self-contained and
> > there’s no
> > > > >>>>> need to switch between the KIP and the codebase.
> > > > >>>>> I understand that you’re incorporating the existing metrics from
> > the old
> > > > >>>>> protocol into the new one, with the goal of maintaining
> > consistency in
> > > > >>>> the
> > > > >>>>> metrics provided. However, I still have a few questions that
> > might be
> > > > >>>> best
> > > > >>>>> addressed here, as this seems like the ideal time to raise them
> > and
> > > > >>>>> reconsider our approach.
> > > > >>>>> -
> > > > >>>>>
> > > > >>>>> 1. Why are the new metrics being recorded at the thread level
> > > > >>>> exclusively?
> > > > >>>>> Would there be value in exposing these metrics at additional
> > levels (such
> > > > >>>>> as application), especially for operators managing large
> > topologies?
> > > > >>>>> -
> > > > >>>>>
> > > > >>>>> 2. Are the chosen latency metrics—average and max—sufficient for
> > > > >>>> diagnosing
> > > > >>>>> issues in production, or should more granular statistics (e.g.,
> > > > >>>> percentile
> > > > >>>>> latencies) be considered to improve observability?
> > > > >>>>>
> > > > >>>>> Let me know your thoughts!
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> Bests,
> > > > >>>>> Alieh
> > > > >>>>>
> > > > >>>>> On Wed, Sep 10, 2025 at 7:38 PM Travis Zhang
> > <[email protected]
> > > > >>>>>
> > > > >>>>> wrote:
> > > > >>>>>
> > > > >>>>>> Hi,
> > > > >>>>>>
> > > > >>>>>> I'd like to start a discussion on
> > > > >>>>>> KIP-1216:
> > > > >>>>>>
> > > > >>>>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1216%3A+Add+rebalance+listener+metrics+for+Kafka+Streams
> > > > >>>>>>
> > > > >>>>>> This KIP proposes adding latency metrics for each rebalance
> > callback
> > > > >>>>>> to provide operators with the observability needed to
> > effectively
> > > > >>>>>> monitor and optimize Kafka Streams applications in production
> > > > >>>>>> environments.
> > > > >>>>>>
> > > > >>>>>> Thanks,
> > > > >>>>>> Travis
> > > > >>>>>>
> > > > >>>>
> > > >
> > > >
> >

Re: [DISCUSS] KIP-1216: Add rebalance listener metrics for Kafka Streams

Reply via email to