Hello Bruno, Thanks for your comments!
1. Regarding the metrics group name: originally I put "stream-state-metrics" as it's related to state store restorations, but after a second thought I think I agree with you that this is quite confusing and not right. About the metrics groups, right now I have two ideas: a) Still use the metric group name "stream-thread-metrics", but differentiate with the processing threads on the thread id. The pros is that we do not introduce a new group, the cons is that users who want to separate processing from restoration/updating in the future needs to do that on the thread id labels. b) Introduce a new group name, for example "stream-state-updater-metrics" and still have a thread-id label. We would be introducing a new group which still have a thread-id, which may sound a bit weird (maybe if we do that we could change the existing "stream-thread-metrics" into "stream-processor-metrics"). Right now I'm leaning towards b) and maybe in the future rename "thread-metrics" to "processor-metrics", LMK what do you think. 2. Regarding the metric names: today we may also pause a standby tasks, and hence I'm trying to differentiate updating standbys from paused standbys. Right now I'm thinking we can change "restoring-standby-tasks" to "updating-standby-tasks", and change all other metrics' "restore" (except the "restoring-active-tasks") to "state-update", a.k.a "state-update-ratio", "state-update-records-total", "updating-standby-tasks" etc. 3. Regarding the function name: yeah I think that's a valid concern, I can change it to "onRestoreSuspended" since "Aborted" may confuse people that previously called "onBatchRestored" are undone as part of the abortion. Guozhang On Mon, Sep 19, 2022 at 10:47 AM Bruno Cadonna <cado...@apache.org> wrote: > Hi Guozhang, > > Thanks for the KIP! I think this KIP is a really nice addition to better > understand what is going on in a Kafka Streams application. > > 1. > The metric names "paused-active-tasks" and "paused-standby-tasks" might > be a bit confusing since at least active tasks can be paused also > outside of restoration. > > 2. > Why is the type of the metrics "stream-state-metrics"? I would have > expected "stream-thread-metrics" as the type. > > 3. > Isn't the value of the metric "restoring-standby-tasks" simply the > number of standby tasks since standby tasks are basically always > updating (aka restoring)? > > 4. > "idle-ratio", "restore-ratio", and "checkpoint-ratio" seem metrics > tailored to the upcoming state updater. They do not make much sense with > a stream thread. Would it be better to introduce a new metrics level > specifically for the state updater? > > 5. > Personally, I do not like to use the word "restoration" together with > standbys since restoration somehow implies that there is an offset for > which the active task is considered restored and active processing can > start. In other words, restoration is finite. Standby tasks rather > update continuously their states. They can be up-to-date or lagging. I > see that you could say "restored" instead of "up-to-date" and "not > restored" instead of "lagging", but IMO it does not describe well the > situation. That is a rather minor point. I just wanted to mention it. > > 6. > The name "onRestorePaused()" might be confusing since in Kafka Streams > users can also pause tasks. What about "onRestoreAborted()" or > "onRestoreSuspended"? > > Best, > Bruno > > > On 16.09.22 19:33, Guozhang Wang wrote: > > Hello everyone, > > > > I'd like to start a discussion for the following KIP, aiming to improve > > Kafka Stream's restoration visibility via new metrics and callback > methods: > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility > > > > > > Thanks! > > -- Guozhang > > > -- -- Guozhang