Thanks for all the very helpful discussions, I'm closing the vote with a tally here:
+1: 7 (Nick, John, Walker, Bruno, Lucas, Matthias, Guozhang), with 5 binding votes and 2 non-binding votes. -1: 0 Guozhang On Wed, Jan 25, 2023 at 5:48 PM Matthias J. Sax <mj...@apache.org> wrote: > > Thanks! > > +1 (binding) > > -Matthias > > On 1/24/23 1:17 PM, Guozhang Wang wrote: > > Hi Matthias: > > > > re "paused" -> "suspended": I got your point now, thanks. Just to > > clarify the two functions are a bit different: "paused" tasks are > > because of the topology being paused, i.e. from KIP-834; whereas > > "suspended" tasks are when a restoring tasks are being removed before > > it completes due to a follow-up rebalance, and this is to distinguish > > with "onRestoreEnd", as described in KAFKA-10575. A suspended task is > > no longer owned by the thread and hence there's no need to measure the > > number of such tasks. > > > > re: "restore-ratio": that's a good point. I like it to function in the > > same way as the "records-rate" metrics. Will update the wiki. > > > > re: making "restore-remaining-records-total" at INFO level: sounds > > good to me too. I will also update the metric name a bit to be more > > specific. > > > > > > > > On Thu, Jan 19, 2023 at 2:35 PM Guozhang Wang > > <guozhang.wang...@gmail.com> wrote: > >> > >> Hello Matthias, > >> > >> Thanks for the feedback. I was on vacation for a while. Pardon for the > >> late replies. Please see them inline below > >> > >> On Thu, Dec 1, 2022 at 11:23 PM Matthias J. Sax <mj...@apache.org> wrote: > >>> > >>> Seems I am late to the party... Great KIP. Couple of questions from my > >>> side: > >>> > >>> (1) What is the purpose of `standby-updating-tasks`? It seems to be the > >>> same as the number of assigned standby task? Not sure how useful it > >>> would be? > >>> > >> In general, yes, it is the number of assigned standby tasks --- there > >> will be transit times when the assigned standby tasks are not yet > >> being updated but it would not last long --- but we do not yet have a > >> direct gauge to expose this before, and users have to infer this from > >> other indirect metrics. > >> > >>> > >>> > >>> (2) `active-paused-tasks` / `standby-paused-tasks` -- what does "paused" > >>> exactly mean? There was a discussion about renaming the callback method > >>> from pause to suspended. So should this be called `suspended`, too? And > >>> if yes, how is it useful for users? > >>> > >> Pausing here refers to "KIP-834: Pause / Resume KafkaStreams > >> Topologies" > >> (https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211882832). > >> When a topology is paused, all its tasks including standbys will be > >> paused too. > >> > >> I'm not aware of a discussion to rename the call name to "suspend" for > >> KIP-834. Could you point me to the reference? > >> > >>> > >>> > >>> (3) `restore-ratio`: the description says > >>> > >>>> The fraction of time the thread spent on restoring active or standby > >>>> tasks > >>> > >>> I find the term "restoring" does only apply to active tasks, but not to > >>> standbys. Can we reword this? > >>> > >> Yeah I have been discussing this with others in the community a bit as > >> well, but so far I have not been convinced of a better name than it. > >> Some other alternatives being discussed but not win everyone's love is > >> "restore-or-update-ratio", "process-ratio" (for the restore thread > >> that means restoring or updating), and "io-ratio". > >> > >> The only one so far that I feel is probably better, is > >> "state-update-ratio". If folks feel this one is better than > >> "restore-ratio" I'm happy to update. > >> > >>> > >>> (4) `restore-call-rate`: not sure what you exactly mean by "restore > >>> calls"? > >>> > >> This is similar to the "io-calls-rate" in the selector classes, i.e. > >> the number of "restore" function calls made. It's argurably a very > >> low-level metrics but I included it since it could be useful in some > >> debugging scenarios. > >> > >>> > >>> (5) `restore-remaining-records-total` -- why is this a task metric? > >>> Seems we could roll it up into a thread metric that we report at INFO > >>> level (we could still have per-task DEBUG level metric for it in > >>> addition). > >>> > >> The rationale behind it is the general principle in metrics design > >> that "Kafka would provide the lowest necessary metrics levels, and > >> users can do the roll-ups however they want". > >> > >>> > >>> (6) What about "warmup tasks"? Internally, we treat them as standbys, > >>> but it seems it's hard for users to reason about it in the scale-out > >>> warm-up case. Would it be helpful (and possible) to report "warmup > >>> progress" explicitly? > >>> > >> At the restore thread level, we cannot differentiate standby tasks > >> from warmup tasks since the latter is created exactly just like the > >> former. But I do agree this is an issue for visibility that worth > >> addressing, I think another KIP would be needed to first consider > >> distinguishing these two at the class level. > >> > >>> > >>> -Matthias > >>> > >>> > >>> On 11/1/22 2:44 AM, Lucas Brutschy wrote: > >>>> We need this! > >>>> > >>>> + 1 non binding > >>>> > >>>> Cheers, > >>>> Lucas > >>>> > >>>> On Tue, Nov 1, 2022 at 10:01 AM Bruno Cadonna <cado...@apache.org> wrote: > >>>>> > >>>>> Guozhang, > >>>>> > >>>>> Thanks for the KIP! > >>>>> > >>>>> +1 (binding) > >>>>> > >>>>> Best, > >>>>> Bruno > >>>>> > >>>>> On 25.10.22 22:07, Walker Carlson wrote: > >>>>>> +1 non binding > >>>>>> > >>>>>> Thanks for the kip! > >>>>>> > >>>>>> On Thu, Oct 20, 2022 at 10:25 PM John Roesler <vvcep...@apache.org> > >>>>>> wrote: > >>>>>> > >>>>>>> Thanks for the KIP, Guozhang! > >>>>>>> > >>>>>>> I'm +1 (binding) > >>>>>>> > >>>>>>> -John > >>>>>>> > >>>>>>> On Wed, Oct 12, 2022, at 16:36, Nick Telford wrote: > >>>>>>>> Can't wait! > >>>>>>>> +1 (non-binding) > >>>>>>>> > >>>>>>>> On Wed, 12 Oct 2022, 18:02 Guozhang Wang, > >>>>>>>> <guozhang.wang...@gmail.com> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> Hello all, > >>>>>>>>> > >>>>>>>>> I'd like to start a vote for the following KIP, aiming to improve > >>>>>>>>> Kafka > >>>>>>>>> Stream's restoration visibility via new metrics and callback > >>>>>>>>> methods: > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Thanks! > >>>>>>>>> -- Guozhang > >>>>>>>>> > >>>>>>> > >>>>>>