Thanks Guozhang. Couple of clarifications and follow up questions.


I'm not aware of a discussion to rename the call name to "suspend" for
KIP-834. Could you point me to the reference?

My commend was not about KIP-834, but about this KIP. You originally proposed to call the new call-back `onRestorePause` but to avoid confusion it was improved and renamed to `onRestoreSuspended`.


The only one so far that I feel is probably better, is
"state-update-ratio". If folks feel this one is better than
"restore-ratio" I'm happy to update.

Could we actually report two metric, one for the restore phase (restore-ration), and one for steady state ([standby-]update-ratio)?

I could like with `state-update-ratio` if we want to have a single metric for both, but splitting them sound useful to me.


(4) `restore-call-rate`

Maybe we can clarify in the description a little bit. I agree it's very low level but if you think it could be useful to debugging, I have no objection.


The rationale behind it is the general principle in metrics design
that "Kafka would provide the lowest necessary metrics levels, and
users can do the roll-ups however they want".

That's fair, but it seems to be a rather important metric, and having it only at DEBUG level seems not ideal? Could we make it INFO level, even if it's a task level (ie, apply an exception to the rule).



-Matthias



On 1/19/23 2:35 PM, Guozhang Wang wrote:
Hello Matthias,

Thanks for the feedback. I was on vacation for a while. Pardon for the
late replies. Please see them inline below

On Thu, Dec 1, 2022 at 11:23 PM Matthias J. Sax <mj...@apache.org> wrote:

Seems I am late to the party... Great KIP. Couple of questions from my side:

(1) What is the purpose of `standby-updating-tasks`? It seems to be the
same as the number of assigned standby task? Not sure how useful it
would be?

In general, yes, it is the number of assigned standby tasks --- there
will be transit times when the assigned standby tasks are not yet
being updated but it would not last long --- but we do not yet have a
direct gauge to expose this before, and users have to infer this from
other indirect metrics.



(2) `active-paused-tasks` / `standby-paused-tasks` -- what does "paused"
exactly mean? There was a discussion about renaming the callback method
from pause to suspended. So should this be called `suspended`, too? And
if yes, how is it useful for users?

Pausing here refers to "KIP-834: Pause / Resume KafkaStreams
Topologies" 
(https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211882832).
When a topology is paused, all its tasks including standbys will be
paused too.

I'm not aware of a discussion to rename the call name to "suspend" for
KIP-834. Could you point me to the reference?



(3) `restore-ratio`: the description says

The fraction of time the thread spent on restoring active or standby tasks

I find the term "restoring" does only apply to active tasks, but not to
standbys. Can we reword this?

Yeah I have been discussing this with others in the community a bit as
well, but so far I have not been convinced of a better name than it.
Some other alternatives being discussed but not win everyone's love is
"restore-or-update-ratio", "process-ratio" (for the restore thread
that means restoring or updating), and "io-ratio".

The only one so far that I feel is probably better, is
"state-update-ratio". If folks feel this one is better than
"restore-ratio" I'm happy to update.


(4) `restore-call-rate`: not sure what you exactly mean by "restore calls"?

This is similar to the "io-calls-rate" in the selector classes, i.e.
the number of "restore" function calls made. It's argurably a very
low-level metrics but I included it since it could be useful in some
debugging scenarios.


(5) `restore-remaining-records-total` -- why is this a task metric?
Seems we could roll it up into a thread metric that we report at INFO
level (we could still have per-task DEBUG level metric for it in addition).

The rationale behind it is the general principle in metrics design
that "Kafka would provide the lowest necessary metrics levels, and
users can do the roll-ups however they want".


(6) What about "warmup tasks"? Internally, we treat them as standbys,
but it seems it's hard for users to reason about it in the scale-out
warm-up case. Would it be helpful (and possible) to report "warmup
progress" explicitly?

At the restore thread level, we cannot differentiate standby tasks
from warmup tasks since the latter is created exactly just like the
former. But I do agree this is an issue for visibility that worth
addressing, I think another KIP would be needed to first consider
distinguishing these two at the class level.


-Matthias


On 11/1/22 2:44 AM, Lucas Brutschy wrote:
We need this!

+ 1 non binding

Cheers,
Lucas

On Tue, Nov 1, 2022 at 10:01 AM Bruno Cadonna <cado...@apache.org> wrote:

Guozhang,

Thanks for the KIP!

+1 (binding)

Best,
Bruno

On 25.10.22 22:07, Walker Carlson wrote:
+1 non binding

Thanks for the kip!

On Thu, Oct 20, 2022 at 10:25 PM John Roesler <vvcep...@apache.org> wrote:

Thanks for the KIP, Guozhang!

I'm +1 (binding)

-John

On Wed, Oct 12, 2022, at 16:36, Nick Telford wrote:
Can't wait!
+1 (non-binding)

On Wed, 12 Oct 2022, 18:02 Guozhang Wang, <guozhang.wang...@gmail.com>
wrote:

Hello all,

I'd like to start a vote for the following KIP, aiming to improve Kafka
Stream's restoration visibility via new metrics and callback methods:



https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility


Thanks!
-- Guozhang



Reply via email to