Hi Travis, Thanks for the KIP! It looks like a useful addition in support of KIP-1017. Overall the KIP LGTM, but I have a follow-up question.
Would we want to consider an additional metric displaying the tasks involved in each of the revoked, assigned, and lost events? This would probably be best at the DEBUG level. Certainly this is an optional suggestion, but I do feel it would be a valuable aid to operators of KS applications. Regards, Bill On Fri, Sep 12, 2025 at 12:04 AM Travis Zhang <tzh...@confluent.io.invalid> wrote: > Hey Alieh, > > Thanks for the great questions and the thoughtful feedback on the KIP! > > Good call on adding the code snippets—I'll get the key class > structures into the KIP to make it fully self-contained. > > You raised some excellent points on the metrics strategy. Here’s my > thinking on them: > > 1. Why Thread-Level Metrics: > > We opted for thread-level reporting for two main reasons: > debuggability and consistency. When a rebalance gets stuck, operators > need to pinpoint exactly which StreamThread is the bottleneck, as each > one can have a very different workload. This approach also aligns with > all other core metrics (like process-latency), which are already > scoped to the thread. > > While it is possible to add application-level aggregates, they > wouldn't offer new insights since any application-wide issue will > always show up in one or more threads. I felt this gives operators the > most diagnostic power without adding noise. > > 2. Avg/Max vs. Percentiles: > > On using avg/max, I think avg/max is good for now, mainly because of > the nature of rebalances. They're infrequent but high-impact events. > Unlike a constant stream of processing operations, a single slow > rebalance is the production issue, making max latency the most > critical signal for an operator. > > Percentiles are less statistically meaningful for such low-frequency > events and introduce a memory overhead we'd like to avoid initially. > > We can definitely consider adding percentiles in a future KIP if we > find avg/max isn't sufficient once this is in production. > > Let me know if this reasoning makes sense. Happy to discuss it more! > > Best, > Travis > > > On Thu, Sep 11, 2025 at 8:53 AM Alieh Saeedi > <asae...@confluent.io.invalid> wrote: > > > > Hey Travis > > > > Thanks for sharing the KIP. > > > > One suggestion (not essential): would it be possible to include the > > relevant code snippet and the new class directly in the KIP in `Proposed > > Changes` section? That way, everything is self-contained and there’s no > > need to switch between the KIP and the codebase. > > I understand that you’re incorporating the existing metrics from the old > > protocol into the new one, with the goal of maintaining consistency in > the > > metrics provided. However, I still have a few questions that might be > best > > addressed here, as this seems like the ideal time to raise them and > > reconsider our approach. > > - > > > > 1. Why are the new metrics being recorded at the thread level > exclusively? > > Would there be value in exposing these metrics at additional levels (such > > as application), especially for operators managing large topologies? > > - > > > > 2. Are the chosen latency metrics—average and max—sufficient for > diagnosing > > issues in production, or should more granular statistics (e.g., > percentile > > latencies) be considered to improve observability? > > > > Let me know your thoughts! > > > > > > Bests, > > Alieh > > > > On Wed, Sep 10, 2025 at 7:38 PM Travis Zhang <tzh...@confluent.io.invalid > > > > wrote: > > > > > Hi, > > > > > > I'd like to start a discussion on > > > KIP-1216: > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1216%3A+Add+rebalance+listener+metrics+for+Kafka+Streams > > > > > > This KIP proposes adding latency metrics for each rebalance callback > > > to provide operators with the observability needed to effectively > > > monitor and optimize Kafka Streams applications in production > > > environments. > > > > > > Thanks, > > > Travis > > > >