Hi Travis,

thanks for the KIP!

Looks good to me. I'm not sure, we need the DEBUG metrics, but we can
add them. I would, however, already also include warm-up tasks in the
metrics, if you are including active / standby. Furthermore, I also
wasn't sure if Bill wanted to add the number of tasks or the actual
task IDs to the DEBUG metrics. Bill, maybe you can comment on that.

I think after hashing out these finer points about the DEBUG metrics,
we can open a vote thread.

Cheers,
Lucas

On Mon, Sep 15, 2025 at 6:38 AM Travis Zhang
<tzh...@confluent.io.invalid> wrote:
>
> Hi Bill,
>
> Thanks for your feedback. It does make sense to me. I've added the
> total task count metrics to the KIP at DEBUG level!
>
> Best,
> Travis
>
> On Fri, Sep 12, 2025 at 11:44 AM Bill Bejeck <bbej...@gmail.com> wrote:
> >
> > Hi Travis,
> >
> > Thanks for the KIP! It looks like a useful addition in support of KIP-1017.
> > Overall the KIP LGTM, but I have a follow-up question.
> >
> > Would we want to consider an additional metric displaying the tasks
> > involved in each of the revoked, assigned, and lost events?
> > This would probably be best at the DEBUG level.
> > Certainly this is an optional suggestion, but I do feel it would be a
> > valuable aid to operators of KS applications.
> >
> > Regards,
> > Bill
> >
> > On Fri, Sep 12, 2025 at 12:04 AM Travis Zhang <tzh...@confluent.io.invalid>
> > wrote:
> >
> > > Hey Alieh,
> > >
> > > Thanks for the great questions and the thoughtful feedback on the KIP!
> > >
> > > Good call on adding the code snippets—I'll get the key class
> > > structures into the KIP to make it fully self-contained.
> > >
> > > You raised some excellent points on the metrics strategy. Here’s my
> > > thinking on them:
> > >
> > > 1. Why Thread-Level Metrics:
> > >
> > > We opted for thread-level reporting for two main reasons:
> > > debuggability and consistency. When a rebalance gets stuck, operators
> > > need to pinpoint exactly which StreamThread is the bottleneck, as each
> > > one can have a very different workload. This approach also aligns with
> > > all other core metrics (like process-latency), which are already
> > > scoped to the thread.
> > >
> > > While it is possible to add application-level aggregates, they
> > > wouldn't offer new insights since any application-wide issue will
> > > always show up in one or more threads. I felt this gives operators the
> > > most diagnostic power without adding noise.
> > >
> > > 2. Avg/Max vs. Percentiles:
> > >
> > > On using avg/max, I think avg/max is good for now, mainly because of
> > > the nature of rebalances. They're infrequent but high-impact events.
> > > Unlike a constant stream of processing operations, a single slow
> > > rebalance is the production issue, making max latency the most
> > > critical signal for an operator.
> > >
> > > Percentiles are less statistically meaningful for such low-frequency
> > > events and introduce a memory overhead we'd like to avoid initially.
> > >
> > > We can definitely consider adding percentiles in a future KIP if we
> > > find avg/max isn't sufficient once this is in production.
> > >
> > > Let me know if this reasoning makes sense. Happy to discuss it more!
> > >
> > > Best,
> > > Travis
> > >
> > >
> > > On Thu, Sep 11, 2025 at 8:53 AM Alieh Saeedi
> > > <asae...@confluent.io.invalid> wrote:
> > > >
> > > > Hey Travis
> > > >
> > > > Thanks for sharing the KIP.
> > > >
> > > > One suggestion (not essential): would it be possible to include the
> > > > relevant code snippet and the new class directly in the KIP in `Proposed
> > > > Changes` section? That way, everything is self-contained and there’s no
> > > > need to switch between the KIP and the codebase.
> > > > I understand that you’re incorporating the existing metrics from the old
> > > > protocol into the new one, with the goal of maintaining consistency in
> > > the
> > > > metrics provided. However, I still have a few questions that might be
> > > best
> > > > addressed here, as this seems like the ideal time to raise them and
> > > > reconsider our approach.
> > > > -
> > > >
> > > > 1. Why are the new metrics being recorded at the thread level
> > > exclusively?
> > > > Would there be value in exposing these metrics at additional levels 
> > > > (such
> > > > as application), especially for operators managing large topologies?
> > > > -
> > > >
> > > > 2. Are the chosen latency metrics—average and max—sufficient for
> > > diagnosing
> > > > issues in production, or should more granular statistics (e.g.,
> > > percentile
> > > > latencies) be considered to improve observability?
> > > >
> > > > Let me know your thoughts!
> > > >
> > > >
> > > > Bests,
> > > > Alieh
> > > >
> > > > On Wed, Sep 10, 2025 at 7:38 PM Travis Zhang 
> > > > <tzh...@confluent.io.invalid
> > > >
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I'd like to start a discussion on
> > > > > KIP-1216:
> > > > >
> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1216%3A+Add+rebalance+listener+metrics+for+Kafka+Streams
> > > > >
> > > > > This KIP proposes adding latency metrics for each rebalance callback
> > > > > to provide operators with the observability needed to effectively
> > > > > monitor and optimize Kafka Streams applications in production
> > > > > environments.
> > > > >
> > > > > Thanks,
> > > > > Travis
> > > > >
> > >

Reply via email to