[jira] [Updated] (KAFKA-18369) State updater's *-ratio metrics are incorrect

Jira Sun, 29 Dec 2024 11:16:06 -0800


     [ 
https://issues.apache.org/jira/browse/KAFKA-18369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rafał Sumisławski updated KAFKA-18369:
--------------------------------------
    Description: 
h2. Background

{{DefaultStateUpdater}} defines {{{}idle-ratio{}}}, 
{{{}active-restore-ratio{}}}, {{{}standby-update-ratio{}}}, 
{{checkpoint-ratio}} metrics here: 
[https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/processor/internals/DefaultStateUpdater.java#L1101-L1115]
 These metrics are averages, that are supposed to indicate "The fraction of 
time the thread spent on \{action}". But the metrics don't actually do that.
h2. Issue

Let me explain this with an example:

For simplicity's sake, let's consider the following example involving just 
{{{}standby-update-ratio{}}}, {{checkpoint-ratio}} and ignoring the existence 
of the other two metrics.
Let's say the thread did:
 * {{999}} iterations with {{standby-update}} taking {{1ms}} in each iteration 
and no {{checkpoint}} happening ({{{}0ms{}}}).
 * {{1}} iteration with {{standby-update}} taking {{1ms}} and {{checkpoint}} 
taking {{9000ms}}

The thread spent {{10s}} working, of which it spent {{1s}} on 
{{standby-updates}} and {{9s}} on checkpoint, so the fraction of time it spent 
on checkpoint (checkpoint-ratio) is {{{}~0.001 (0.1%){}}}. Or at least that is 
what the metrics will say. I would instead argue that it spent {{9s/10s == 0.9 
== 90%}} on checkpoint. If you agree with my logic, then you agree that this 
metrics is incorrect.

The problem is that the code computes a ratio for each iteration, and then 
averages those ratios out, producing a number devoid of statistical meaning or 
practical application. It ignores the fact that the one iteration that took 
{{{}9s{}}}, should have a much higher weight than those quick 1ms iterations.

{{(999*(0ms/1ms) + 1*(9000ms/9001ms))/1000 ~= 0.001}}
h2. Solution

What we would like to see instead is either a ratio of average/total times, not 
an average of ratios. I don't think this can be easily realised within the 
existing metrics system. So instead, what I propose as a solution is to report 
{{duration-total}} and/or {{duration-rate}} (with the unit of seconds per 
second) metric for each of {{{}idle{}}}, {{{}active-restore{}}}, 
{{{}standby-restore{}}}, {{{}checkpoint{}}}. The observers of these metrics, 
when needed, could then derive the actual ratio of time spent on each operation 
for example as {{{}checkpoint-ratio = checkpoint-duration-rate / 
(idle-duration-rate + active-restore-duration-rate + 
standby-restore-duration-rate + checkpoint-duration-rate){}}}. Or by performing 
an analogical calculation on deltas of the {{total}} metrics.

I can submit a PR once there's an agreement on the correct way to fix this.

  was:
h2. Background

{{DefaultStateUpdater}} defines {{idle-ratio}}, {{active-restore-ratio}}, 
{{standby-update-ratio}}, {{checkpoint-ratio}} metrics here: 
https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/processor/internals/DefaultStateUpdater.java#L1101-L1115
 These metrics are averages, that are supposed to indicate "The fraction of 
time the thread spent on \{action}". But the metrics don't actually do that.

h2. Issue

Let me explain this with an example:

For simplicity's sake, let's consider the following example involving just 
{{standby-update-ratio}}, {{checkpoint-ratio}} and ignoring the existence of 
the other two metrics.
Let's say the thread did:
* {{999}} iterations with {{standby-update}} taking {{1ms}} in each iteration 
and no {{checkpoint}} happening ({{0ms}}).
* {{1}} iteration with {{standby-update}} taking {{1ms}} and {{checkpoint}} 
taking {{9000ms}}

The thread spent {{10s}} working, of which it spent {{1s}} on 
{{standby-updates}} and {{9s}} on checkpoint, so the fraction of time it spent 
on checkpoint (checkpoint-ratio) is {{~0.001 (0.1%)}}. Or at least that is what 
the metrics will say. I would instead argue that it spent {{9s/10s == 0.9 == 
90%}} on checkpoint. If you agree with my logic, then you agree that this 
metrics is incorrect.

The problem is that the code computes a ratio for each iteration, and then 
averages those ratios out, producing a number devoid of statistical meaning or 
practical application. It ignores the fact that the one iteration that took 
{{9s}}, should have a much higher weight than those quick 1ms iterations.

{{(999*(0ms/1ms) + 1*(9000ms/9001ms))/1000 ~= 0.001}}

h2. Solution

What we would like to see instead is either a ratio of average/total times, not 
an average of ratios. I don't think this can be easily realised within the 
existing metrics system. So instead, what I propose as a solution is to report 
{{duration-total}} and/or {{duration-rate}} (with the unit of seconds per 
second) metric for each of {{idle}}, {{active-restore}}, {{standby-restore}}, 
{{checkpoint}}. The observers of these metrics, when needed, could then derive 
the actual ratio of time spent on each operation for example as 
{{checkpoint-ratio = checkpoint-duration-rate / (idle-duration-rate + 
active-restore-duration-rate + standby-restore-duration-rate + 
checkpoint-duration-rate)}}. Or by performing an analogical calculation on 
deltas of the {{total}} metrics.


> State updater's *-ratio metrics are incorrect
> ---------------------------------------------
>
>                 Key: KAFKA-18369
>                 URL: https://issues.apache.org/jira/browse/KAFKA-18369
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 3.8.1
>            Reporter: Rafał Sumisławski
>            Priority: Major
>
> h2. Background
> {{DefaultStateUpdater}} defines {{{}idle-ratio{}}}, 
> {{{}active-restore-ratio{}}}, {{{}standby-update-ratio{}}}, 
> {{checkpoint-ratio}} metrics here: 
> [https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/processor/internals/DefaultStateUpdater.java#L1101-L1115]
>  These metrics are averages, that are supposed to indicate "The fraction of 
> time the thread spent on \{action}". But the metrics don't actually do that.
> h2. Issue
> Let me explain this with an example:
> For simplicity's sake, let's consider the following example involving just 
> {{{}standby-update-ratio{}}}, {{checkpoint-ratio}} and ignoring the existence 
> of the other two metrics.
> Let's say the thread did:
>  * {{999}} iterations with {{standby-update}} taking {{1ms}} in each 
> iteration and no {{checkpoint}} happening ({{{}0ms{}}}).
>  * {{1}} iteration with {{standby-update}} taking {{1ms}} and {{checkpoint}} 
> taking {{9000ms}}
> The thread spent {{10s}} working, of which it spent {{1s}} on 
> {{standby-updates}} and {{9s}} on checkpoint, so the fraction of time it 
> spent on checkpoint (checkpoint-ratio) is {{{}~0.001 (0.1%){}}}. Or at least 
> that is what the metrics will say. I would instead argue that it spent 
> {{9s/10s == 0.9 == 90%}} on checkpoint. If you agree with my logic, then you 
> agree that this metrics is incorrect.
> The problem is that the code computes a ratio for each iteration, and then 
> averages those ratios out, producing a number devoid of statistical meaning 
> or practical application. It ignores the fact that the one iteration that 
> took {{{}9s{}}}, should have a much higher weight than those quick 1ms 
> iterations.
> {{(999*(0ms/1ms) + 1*(9000ms/9001ms))/1000 ~= 0.001}}
> h2. Solution
> What we would like to see instead is either a ratio of average/total times, 
> not an average of ratios. I don't think this can be easily realised within 
> the existing metrics system. So instead, what I propose as a solution is to 
> report {{duration-total}} and/or {{duration-rate}} (with the unit of seconds 
> per second) metric for each of {{{}idle{}}}, {{{}active-restore{}}}, 
> {{{}standby-restore{}}}, {{{}checkpoint{}}}. The observers of these metrics, 
> when needed, could then derive the actual ratio of time spent on each 
> operation for example as {{{}checkpoint-ratio = checkpoint-duration-rate / 
> (idle-duration-rate + active-restore-duration-rate + 
> standby-restore-duration-rate + checkpoint-duration-rate){}}}. Or by 
> performing an analogical calculation on deltas of the {{total}} metrics.
> I can submit a PR once there's an agreement on the correct way to fix this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-18369) State updater's *-ratio metrics are incorrect

Reply via email to