Re: [DISCUSS] KIP-869: Improve Streams State Restoration Visibility

Bruno Cadonna Thu, 22 Sep 2022 05:21:52 -0700

Hi Guozhang!

Thanks for the updates!

1.

Why do you distinguish between active and standby for the total amountof restored/updated records but not for the rate of restored/updatedrecords?

2.

Regarding "standby-records-remaining" I am not sure how useful thismetric is and I am not sure how hard it will be to record. I see theusefulness of monitoring the lag of a single standby state store, but Iam not sure how useful it is to monitor the "lag" of a state updaterthread that might potentially contain multiple state stores.Furthermore, do we need to issue a remote call to record this metric ordo we get this information from each poll?

3.

Could you please give an example where "active-records-restored-total"and "standby-records-updated-total" are useful?


Best,
Bruno

On 20.09.22 22:45, Guozhang Wang wrote:

Hello Bruno/Nick,

I've updated the KIP wiki to reflect the incorporated comments from you,
please feel free to take another look and let me know what you think.


Guozhang

On Tue, Sep 20, 2022 at 9:37 AM Guozhang Wang <wangg...@gmail.com> wrote:

Hi Nick,

Thanks for the reviews, and I think these are good suggestions. Note that
currently the `restore-records-total/rate` would include both restoring
active tasks as well as updating standby tasks, I think for your purposes
you'd be more interested in active restoring tasks since they could
complete, while updating standby tasks would not complete even if they have
caught up with the active. At the same time, the changelog reader would
only be restoring either active or standby at a given time, and active
tasks has a higher priority such that as long as there is at least one
active task still restoring, we would not restore any standby tasks. From
your suggestion, I'm thinking that maybe I should break up the rate / total
metric for active and standby tasks separately.

For deriving estimated time remaining though, the `total` metric may not
be helpful since they will not be "reset" after rebalances, i.e. they will
be an ever-increasing number and record the total number of records for the
lifetime of the app. But still, just the remaining records alone, with the
time elapsed monitored by the apps, should be sufficient to get the
estimated time remaining.


Guozhang


On Tue, Sep 20, 2022 at 3:10 AM Nick Telford <nick.telf...@gmail.com>
wrote:

Hi Guozhang,

KIP looks great, I have one suggestion: in addition to
"restore-records-total", it would also be useful to track the number of
records *remaining*, that is, the records that have not yet been restored.
This is actually the metric I was attempting to implement in the
StateRestoreListener that bumped me up against KAFKA-10575 :-)

With both a "restore-records-total" and a "restore-remaining-total" (or
similar) metric, it's possible to derive useful information like the
estimated time remaining for restoration (by dividing the remaining total
by the restoration rate).

Regards,

Nick

On Mon, 19 Sept 2022 at 19:57, Guozhang Wang <wangg...@gmail.com> wrote:

Hello Bruno,

Thanks for your comments!

1. Regarding the metrics group name: originally I put
"stream-state-metrics" as it's related to state store restorations, but
after a second thought I think I agree with you that this is quite
confusing and not right. About the metrics groups, right now I have two
ideas:

a) Still use the metric group name "stream-thread-metrics", but
differentiate with the processing threads on the thread id. The pros is
that we do not introduce a new group, the cons is that users who want to
separate processing from restoration/updating in the future needs to do
that on the thread id labels.
b) Introduce a new group name, for example

"stream-state-updater-metrics"

and still have a thread-id label. We would be introducing a new group

which

still have a thread-id, which may sound a bit weird (maybe if we do

that we

could change the existing "stream-thread-metrics" into
"stream-processor-metrics").

Right now I'm leaning towards b) and maybe in the future rename
"thread-metrics" to "processor-metrics", LMK what do you think.

2. Regarding the metric names: today we may also pause a standby tasks,

and

hence I'm trying to differentiate updating standbys from paused

standbys.

Right now I'm thinking we can change "restoring-standby-tasks" to
"updating-standby-tasks", and change all other metrics' "restore"

(except

the "restoring-active-tasks") to "state-update", a.k.a
"state-update-ratio", "state-update-records-total",
"updating-standby-tasks" etc.

3. Regarding the function name: yeah I think that's a valid concern, I

can

change it to "onRestoreSuspended" since "Aborted" may confuse people

that

previously called "onBatchRestored" are undone as part of the abortion.


Guozhang



On Mon, Sep 19, 2022 at 10:47 AM Bruno Cadonna <cado...@apache.org>

wrote:

Hi Guozhang,

Thanks for the KIP! I think this KIP is a really nice addition to

better

understand what is going on in a Kafka Streams application.

1.
The metric names "paused-active-tasks" and "paused-standby-tasks"

might

be a bit confusing since at least active tasks can be paused also
outside of restoration.

2.
Why is the type of the metrics "stream-state-metrics"? I would have
expected "stream-thread-metrics" as the type.

3.
Isn't the value of the metric "restoring-standby-tasks" simply the
number of standby tasks since standby tasks are basically always
updating (aka restoring)?

4.
"idle-ratio", "restore-ratio", and "checkpoint-ratio" seem metrics
tailored to the upcoming state updater. They do not make much sense

with

a stream thread. Would it be better to introduce a new metrics level
specifically for the state updater?

5.
Personally, I do not like to use the word "restoration" together with
standbys since restoration somehow implies that there is an offset for
which the active task is considered restored and active processing can
start. In other words, restoration is finite. Standby tasks rather
update continuously their states. They can be up-to-date or lagging. I
see that you could say "restored" instead of "up-to-date" and "not
restored" instead of "lagging", but IMO it does not describe well the
situation. That is a rather minor point. I just wanted to mention it.

6.
The name "onRestorePaused()" might be confusing since in Kafka Streams
users can also pause tasks. What about "onRestoreAborted()" or
"onRestoreSuspended"?

Best,
Bruno


On 16.09.22 19:33, Guozhang Wang wrote:

Hello everyone,

I'd like to start a discussion for the following KIP, aiming to

improve

Kafka Stream's restoration visibility via new metrics and callback

methods:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility



Thanks!
-- Guozhang



--
-- Guozhang



--
-- Guozhang

Re: [DISCUSS] KIP-869: Improve Streams State Restoration Visibility

Reply via email to