Hi Matthias:

re "paused" -> "suspended": I got your point now, thanks. Just to
clarify the two functions are a bit different: "paused" tasks are
because of the topology being paused, i.e. from KIP-834; whereas
"suspended" tasks are when a restoring tasks are being removed before
it completes due to a follow-up rebalance, and this is to distinguish
with "onRestoreEnd", as described in KAFKA-10575. A suspended task is
no longer owned by the thread and hence there's no need to measure the
number of such tasks.

re: "restore-ratio": that's a good point. I like it to function in the
same way as the "records-rate" metrics. Will update the wiki.

re: making "restore-remaining-records-total" at INFO level: sounds
good to me too. I will also update the metric name a bit to be more
specific.



On Thu, Jan 19, 2023 at 2:35 PM Guozhang Wang
<guozhang.wang...@gmail.com> wrote:
>
> Hello Matthias,
>
> Thanks for the feedback. I was on vacation for a while. Pardon for the
> late replies. Please see them inline below
>
> On Thu, Dec 1, 2022 at 11:23 PM Matthias J. Sax <mj...@apache.org> wrote:
> >
> > Seems I am late to the party... Great KIP. Couple of questions from my side:
> >
> > (1) What is the purpose of `standby-updating-tasks`? It seems to be the
> > same as the number of assigned standby task? Not sure how useful it
> > would be?
> >
> In general, yes, it is the number of assigned standby tasks --- there
> will be transit times when the assigned standby tasks are not yet
> being updated but it would not last long --- but we do not yet have a
> direct gauge to expose this before, and users have to infer this from
> other indirect metrics.
>
> >
> >
> > (2) `active-paused-tasks` / `standby-paused-tasks` -- what does "paused"
> > exactly mean? There was a discussion about renaming the callback method
> > from pause to suspended. So should this be called `suspended`, too? And
> > if yes, how is it useful for users?
> >
> Pausing here refers to "KIP-834: Pause / Resume KafkaStreams
> Topologies" 
> (https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211882832).
> When a topology is paused, all its tasks including standbys will be
> paused too.
>
> I'm not aware of a discussion to rename the call name to "suspend" for
> KIP-834. Could you point me to the reference?
>
> >
> >
> > (3) `restore-ratio`: the description says
> >
> > > The fraction of time the thread spent on restoring active or standby tasks
> >
> > I find the term "restoring" does only apply to active tasks, but not to
> > standbys. Can we reword this?
> >
> Yeah I have been discussing this with others in the community a bit as
> well, but so far I have not been convinced of a better name than it.
> Some other alternatives being discussed but not win everyone's love is
> "restore-or-update-ratio", "process-ratio" (for the restore thread
> that means restoring or updating), and "io-ratio".
>
> The only one so far that I feel is probably better, is
> "state-update-ratio". If folks feel this one is better than
> "restore-ratio" I'm happy to update.
>
> >
> > (4) `restore-call-rate`: not sure what you exactly mean by "restore calls"?
> >
> This is similar to the "io-calls-rate" in the selector classes, i.e.
> the number of "restore" function calls made. It's argurably a very
> low-level metrics but I included it since it could be useful in some
> debugging scenarios.
>
> >
> > (5) `restore-remaining-records-total` -- why is this a task metric?
> > Seems we could roll it up into a thread metric that we report at INFO
> > level (we could still have per-task DEBUG level metric for it in addition).
> >
> The rationale behind it is the general principle in metrics design
> that "Kafka would provide the lowest necessary metrics levels, and
> users can do the roll-ups however they want".
>
> >
> > (6) What about "warmup tasks"? Internally, we treat them as standbys,
> > but it seems it's hard for users to reason about it in the scale-out
> > warm-up case. Would it be helpful (and possible) to report "warmup
> > progress" explicitly?
> >
> At the restore thread level, we cannot differentiate standby tasks
> from warmup tasks since the latter is created exactly just like the
> former. But I do agree this is an issue for visibility that worth
> addressing, I think another KIP would be needed to first consider
> distinguishing these two at the class level.
>
> >
> > -Matthias
> >
> >
> > On 11/1/22 2:44 AM, Lucas Brutschy wrote:
> > > We need this!
> > >
> > > + 1 non binding
> > >
> > > Cheers,
> > > Lucas
> > >
> > > On Tue, Nov 1, 2022 at 10:01 AM Bruno Cadonna <cado...@apache.org> wrote:
> > >>
> > >> Guozhang,
> > >>
> > >> Thanks for the KIP!
> > >>
> > >> +1 (binding)
> > >>
> > >> Best,
> > >> Bruno
> > >>
> > >> On 25.10.22 22:07, Walker Carlson wrote:
> > >>> +1 non binding
> > >>>
> > >>> Thanks for the kip!
> > >>>
> > >>> On Thu, Oct 20, 2022 at 10:25 PM John Roesler <vvcep...@apache.org> 
> > >>> wrote:
> > >>>
> > >>>> Thanks for the KIP, Guozhang!
> > >>>>
> > >>>> I'm +1 (binding)
> > >>>>
> > >>>> -John
> > >>>>
> > >>>> On Wed, Oct 12, 2022, at 16:36, Nick Telford wrote:
> > >>>>> Can't wait!
> > >>>>> +1 (non-binding)
> > >>>>>
> > >>>>> On Wed, 12 Oct 2022, 18:02 Guozhang Wang, <guozhang.wang...@gmail.com>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Hello all,
> > >>>>>>
> > >>>>>> I'd like to start a vote for the following KIP, aiming to improve 
> > >>>>>> Kafka
> > >>>>>> Stream's restoration visibility via new metrics and callback methods:
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility
> > >>>>>>
> > >>>>>>
> > >>>>>> Thanks!
> > >>>>>> -- Guozhang
> > >>>>>>
> > >>>>
> > >>>

Reply via email to