Hi Travis, thanks for the KIP!
Looks good to me. I'm not sure, we need the DEBUG metrics, but we can add them. I would, however, already also include warm-up tasks in the metrics, if you are including active / standby. Furthermore, I also wasn't sure if Bill wanted to add the number of tasks or the actual task IDs to the DEBUG metrics. Bill, maybe you can comment on that. I think after hashing out these finer points about the DEBUG metrics, we can open a vote thread. Cheers, Lucas On Mon, Sep 15, 2025 at 6:38 AM Travis Zhang <tzh...@confluent.io.invalid> wrote: > > Hi Bill, > > Thanks for your feedback. It does make sense to me. I've added the > total task count metrics to the KIP at DEBUG level! > > Best, > Travis > > On Fri, Sep 12, 2025 at 11:44 AM Bill Bejeck <bbej...@gmail.com> wrote: > > > > Hi Travis, > > > > Thanks for the KIP! It looks like a useful addition in support of KIP-1017. > > Overall the KIP LGTM, but I have a follow-up question. > > > > Would we want to consider an additional metric displaying the tasks > > involved in each of the revoked, assigned, and lost events? > > This would probably be best at the DEBUG level. > > Certainly this is an optional suggestion, but I do feel it would be a > > valuable aid to operators of KS applications. > > > > Regards, > > Bill > > > > On Fri, Sep 12, 2025 at 12:04 AM Travis Zhang <tzh...@confluent.io.invalid> > > wrote: > > > > > Hey Alieh, > > > > > > Thanks for the great questions and the thoughtful feedback on the KIP! > > > > > > Good call on adding the code snippets—I'll get the key class > > > structures into the KIP to make it fully self-contained. > > > > > > You raised some excellent points on the metrics strategy. Here’s my > > > thinking on them: > > > > > > 1. Why Thread-Level Metrics: > > > > > > We opted for thread-level reporting for two main reasons: > > > debuggability and consistency. When a rebalance gets stuck, operators > > > need to pinpoint exactly which StreamThread is the bottleneck, as each > > > one can have a very different workload. This approach also aligns with > > > all other core metrics (like process-latency), which are already > > > scoped to the thread. > > > > > > While it is possible to add application-level aggregates, they > > > wouldn't offer new insights since any application-wide issue will > > > always show up in one or more threads. I felt this gives operators the > > > most diagnostic power without adding noise. > > > > > > 2. Avg/Max vs. Percentiles: > > > > > > On using avg/max, I think avg/max is good for now, mainly because of > > > the nature of rebalances. They're infrequent but high-impact events. > > > Unlike a constant stream of processing operations, a single slow > > > rebalance is the production issue, making max latency the most > > > critical signal for an operator. > > > > > > Percentiles are less statistically meaningful for such low-frequency > > > events and introduce a memory overhead we'd like to avoid initially. > > > > > > We can definitely consider adding percentiles in a future KIP if we > > > find avg/max isn't sufficient once this is in production. > > > > > > Let me know if this reasoning makes sense. Happy to discuss it more! > > > > > > Best, > > > Travis > > > > > > > > > On Thu, Sep 11, 2025 at 8:53 AM Alieh Saeedi > > > <asae...@confluent.io.invalid> wrote: > > > > > > > > Hey Travis > > > > > > > > Thanks for sharing the KIP. > > > > > > > > One suggestion (not essential): would it be possible to include the > > > > relevant code snippet and the new class directly in the KIP in `Proposed > > > > Changes` section? That way, everything is self-contained and there’s no > > > > need to switch between the KIP and the codebase. > > > > I understand that you’re incorporating the existing metrics from the old > > > > protocol into the new one, with the goal of maintaining consistency in > > > the > > > > metrics provided. However, I still have a few questions that might be > > > best > > > > addressed here, as this seems like the ideal time to raise them and > > > > reconsider our approach. > > > > - > > > > > > > > 1. Why are the new metrics being recorded at the thread level > > > exclusively? > > > > Would there be value in exposing these metrics at additional levels > > > > (such > > > > as application), especially for operators managing large topologies? > > > > - > > > > > > > > 2. Are the chosen latency metrics—average and max—sufficient for > > > diagnosing > > > > issues in production, or should more granular statistics (e.g., > > > percentile > > > > latencies) be considered to improve observability? > > > > > > > > Let me know your thoughts! > > > > > > > > > > > > Bests, > > > > Alieh > > > > > > > > On Wed, Sep 10, 2025 at 7:38 PM Travis Zhang > > > > <tzh...@confluent.io.invalid > > > > > > > > wrote: > > > > > > > > > Hi, > > > > > > > > > > I'd like to start a discussion on > > > > > KIP-1216: > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1216%3A+Add+rebalance+listener+metrics+for+Kafka+Streams > > > > > > > > > > This KIP proposes adding latency metrics for each rebalance callback > > > > > to provide operators with the observability needed to effectively > > > > > monitor and optimize Kafka Streams applications in production > > > > > environments. > > > > > > > > > > Thanks, > > > > > Travis > > > > > > > >