Hi Matthias: re "paused" -> "suspended": I got your point now, thanks. Just to clarify the two functions are a bit different: "paused" tasks are because of the topology being paused, i.e. from KIP-834; whereas "suspended" tasks are when a restoring tasks are being removed before it completes due to a follow-up rebalance, and this is to distinguish with "onRestoreEnd", as described in KAFKA-10575. A suspended task is no longer owned by the thread and hence there's no need to measure the number of such tasks.
re: "restore-ratio": that's a good point. I like it to function in the same way as the "records-rate" metrics. Will update the wiki. re: making "restore-remaining-records-total" at INFO level: sounds good to me too. I will also update the metric name a bit to be more specific. On Thu, Jan 19, 2023 at 2:35 PM Guozhang Wang <guozhang.wang...@gmail.com> wrote: > > Hello Matthias, > > Thanks for the feedback. I was on vacation for a while. Pardon for the > late replies. Please see them inline below > > On Thu, Dec 1, 2022 at 11:23 PM Matthias J. Sax <mj...@apache.org> wrote: > > > > Seems I am late to the party... Great KIP. Couple of questions from my side: > > > > (1) What is the purpose of `standby-updating-tasks`? It seems to be the > > same as the number of assigned standby task? Not sure how useful it > > would be? > > > In general, yes, it is the number of assigned standby tasks --- there > will be transit times when the assigned standby tasks are not yet > being updated but it would not last long --- but we do not yet have a > direct gauge to expose this before, and users have to infer this from > other indirect metrics. > > > > > > > (2) `active-paused-tasks` / `standby-paused-tasks` -- what does "paused" > > exactly mean? There was a discussion about renaming the callback method > > from pause to suspended. So should this be called `suspended`, too? And > > if yes, how is it useful for users? > > > Pausing here refers to "KIP-834: Pause / Resume KafkaStreams > Topologies" > (https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211882832). > When a topology is paused, all its tasks including standbys will be > paused too. > > I'm not aware of a discussion to rename the call name to "suspend" for > KIP-834. Could you point me to the reference? > > > > > > > (3) `restore-ratio`: the description says > > > > > The fraction of time the thread spent on restoring active or standby tasks > > > > I find the term "restoring" does only apply to active tasks, but not to > > standbys. Can we reword this? > > > Yeah I have been discussing this with others in the community a bit as > well, but so far I have not been convinced of a better name than it. > Some other alternatives being discussed but not win everyone's love is > "restore-or-update-ratio", "process-ratio" (for the restore thread > that means restoring or updating), and "io-ratio". > > The only one so far that I feel is probably better, is > "state-update-ratio". If folks feel this one is better than > "restore-ratio" I'm happy to update. > > > > > (4) `restore-call-rate`: not sure what you exactly mean by "restore calls"? > > > This is similar to the "io-calls-rate" in the selector classes, i.e. > the number of "restore" function calls made. It's argurably a very > low-level metrics but I included it since it could be useful in some > debugging scenarios. > > > > > (5) `restore-remaining-records-total` -- why is this a task metric? > > Seems we could roll it up into a thread metric that we report at INFO > > level (we could still have per-task DEBUG level metric for it in addition). > > > The rationale behind it is the general principle in metrics design > that "Kafka would provide the lowest necessary metrics levels, and > users can do the roll-ups however they want". > > > > > (6) What about "warmup tasks"? Internally, we treat them as standbys, > > but it seems it's hard for users to reason about it in the scale-out > > warm-up case. Would it be helpful (and possible) to report "warmup > > progress" explicitly? > > > At the restore thread level, we cannot differentiate standby tasks > from warmup tasks since the latter is created exactly just like the > former. But I do agree this is an issue for visibility that worth > addressing, I think another KIP would be needed to first consider > distinguishing these two at the class level. > > > > > -Matthias > > > > > > On 11/1/22 2:44 AM, Lucas Brutschy wrote: > > > We need this! > > > > > > + 1 non binding > > > > > > Cheers, > > > Lucas > > > > > > On Tue, Nov 1, 2022 at 10:01 AM Bruno Cadonna <cado...@apache.org> wrote: > > >> > > >> Guozhang, > > >> > > >> Thanks for the KIP! > > >> > > >> +1 (binding) > > >> > > >> Best, > > >> Bruno > > >> > > >> On 25.10.22 22:07, Walker Carlson wrote: > > >>> +1 non binding > > >>> > > >>> Thanks for the kip! > > >>> > > >>> On Thu, Oct 20, 2022 at 10:25 PM John Roesler <vvcep...@apache.org> > > >>> wrote: > > >>> > > >>>> Thanks for the KIP, Guozhang! > > >>>> > > >>>> I'm +1 (binding) > > >>>> > > >>>> -John > > >>>> > > >>>> On Wed, Oct 12, 2022, at 16:36, Nick Telford wrote: > > >>>>> Can't wait! > > >>>>> +1 (non-binding) > > >>>>> > > >>>>> On Wed, 12 Oct 2022, 18:02 Guozhang Wang, <guozhang.wang...@gmail.com> > > >>>>> wrote: > > >>>>> > > >>>>>> Hello all, > > >>>>>> > > >>>>>> I'd like to start a vote for the following KIP, aiming to improve > > >>>>>> Kafka > > >>>>>> Stream's restoration visibility via new metrics and callback methods: > > >>>>>> > > >>>>>> > > >>>>>> > > >>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility > > >>>>>> > > >>>>>> > > >>>>>> Thanks! > > >>>>>> -- Guozhang > > >>>>>> > > >>>> > > >>>