Hello Matthias, Thanks for the feedback. I was on vacation for a while. Pardon for the late replies. Please see them inline below
On Thu, Dec 1, 2022 at 11:23 PM Matthias J. Sax <mj...@apache.org> wrote: > > Seems I am late to the party... Great KIP. Couple of questions from my side: > > (1) What is the purpose of `standby-updating-tasks`? It seems to be the > same as the number of assigned standby task? Not sure how useful it > would be? > In general, yes, it is the number of assigned standby tasks --- there will be transit times when the assigned standby tasks are not yet being updated but it would not last long --- but we do not yet have a direct gauge to expose this before, and users have to infer this from other indirect metrics. > > > (2) `active-paused-tasks` / `standby-paused-tasks` -- what does "paused" > exactly mean? There was a discussion about renaming the callback method > from pause to suspended. So should this be called `suspended`, too? And > if yes, how is it useful for users? > Pausing here refers to "KIP-834: Pause / Resume KafkaStreams Topologies" (https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211882832). When a topology is paused, all its tasks including standbys will be paused too. I'm not aware of a discussion to rename the call name to "suspend" for KIP-834. Could you point me to the reference? > > > (3) `restore-ratio`: the description says > > > The fraction of time the thread spent on restoring active or standby tasks > > I find the term "restoring" does only apply to active tasks, but not to > standbys. Can we reword this? > Yeah I have been discussing this with others in the community a bit as well, but so far I have not been convinced of a better name than it. Some other alternatives being discussed but not win everyone's love is "restore-or-update-ratio", "process-ratio" (for the restore thread that means restoring or updating), and "io-ratio". The only one so far that I feel is probably better, is "state-update-ratio". If folks feel this one is better than "restore-ratio" I'm happy to update. > > (4) `restore-call-rate`: not sure what you exactly mean by "restore calls"? > This is similar to the "io-calls-rate" in the selector classes, i.e. the number of "restore" function calls made. It's argurably a very low-level metrics but I included it since it could be useful in some debugging scenarios. > > (5) `restore-remaining-records-total` -- why is this a task metric? > Seems we could roll it up into a thread metric that we report at INFO > level (we could still have per-task DEBUG level metric for it in addition). > The rationale behind it is the general principle in metrics design that "Kafka would provide the lowest necessary metrics levels, and users can do the roll-ups however they want". > > (6) What about "warmup tasks"? Internally, we treat them as standbys, > but it seems it's hard for users to reason about it in the scale-out > warm-up case. Would it be helpful (and possible) to report "warmup > progress" explicitly? > At the restore thread level, we cannot differentiate standby tasks from warmup tasks since the latter is created exactly just like the former. But I do agree this is an issue for visibility that worth addressing, I think another KIP would be needed to first consider distinguishing these two at the class level. > > -Matthias > > > On 11/1/22 2:44 AM, Lucas Brutschy wrote: > > We need this! > > > > + 1 non binding > > > > Cheers, > > Lucas > > > > On Tue, Nov 1, 2022 at 10:01 AM Bruno Cadonna <cado...@apache.org> wrote: > >> > >> Guozhang, > >> > >> Thanks for the KIP! > >> > >> +1 (binding) > >> > >> Best, > >> Bruno > >> > >> On 25.10.22 22:07, Walker Carlson wrote: > >>> +1 non binding > >>> > >>> Thanks for the kip! > >>> > >>> On Thu, Oct 20, 2022 at 10:25 PM John Roesler <vvcep...@apache.org> wrote: > >>> > >>>> Thanks for the KIP, Guozhang! > >>>> > >>>> I'm +1 (binding) > >>>> > >>>> -John > >>>> > >>>> On Wed, Oct 12, 2022, at 16:36, Nick Telford wrote: > >>>>> Can't wait! > >>>>> +1 (non-binding) > >>>>> > >>>>> On Wed, 12 Oct 2022, 18:02 Guozhang Wang, <guozhang.wang...@gmail.com> > >>>>> wrote: > >>>>> > >>>>>> Hello all, > >>>>>> > >>>>>> I'd like to start a vote for the following KIP, aiming to improve Kafka > >>>>>> Stream's restoration visibility via new metrics and callback methods: > >>>>>> > >>>>>> > >>>>>> > >>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility > >>>>>> > >>>>>> > >>>>>> Thanks! > >>>>>> -- Guozhang > >>>>>> > >>>> > >>>