Rafał Sumisławski created KAFKA-18369:
-----------------------------------------
Summary: State updater's `*-ratio` metrics are incorrect
Key: KAFKA-18369
URL: https://issues.apache.org/jira/browse/KAFKA-18369
Project: Kafka
Issue Type: Bug
Components: streams
Affects Versions: 3.8.1
Reporter: Rafał Sumisławski
h2. Background
{{DefaultStateUpdater}} defines {{idle-ratio}}, {{active-restore-ratio}},
{{standby-update-ratio}}, {{checkpoint-ratio}} metrics here:
https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/processor/internals/DefaultStateUpdater.java#L1101-L1115
These metrics are averages, that are supposed to indicate "The fraction of
time the thread spent on \{action}". But the metrics don't actually do that.
h2. Issue
Let me explain this with an example:
For simplicity's sake, let's consider the following example involving just
{{standby-update-ratio}}, {{checkpoint-ratio}} and ignoring the existence of
the other two metrics.
Let's say the thread did:
* {{999}} iterations with {{standby-update}} taking {{1ms}} in each iteration
and no {{checkpoint}} happening ({{0ms}}).
* {{1}} iteration with {{standby-update}} taking {{1ms}} and {{checkpoint}}
taking {{9000ms}}
The thread spent {{10s}} working, of which it spent {{1s}} on
{{standby-updates}} and {{9s}} on checkpoint, so the fraction of time it spent
on checkpoint (checkpoint-ratio) is {{~0.001 (0.1%)}}. Or at least that is what
the metrics will say. I would instead argue that it spent {{9s/10s == 0.9 ==
90%}} on checkpoint. If you agree with my logic, then you agree that this
metrics is incorrect.
The problem is that the code computes a ratio for each iteration, and then
averages those ratios out, producing a number devoid of statistical meaning or
practical application. It ignores the fact that the one iteration that took
{{9s}}, should have a much higher weight than those quick 1ms iterations.
{{(999*(0ms/1ms) + 1*(9000ms/9001ms))/1000 ~= 0.001}}
h2. Solution
What we would like to see instead is either a ratio of average/total times, not
an average of ratios. I don't think this can be easily realised within the
existing metrics system. So instead, what I propose as a solution is to report
{{duration-total}} and/or {{duration-rate}} (with the unit of seconds per
second) metric for each of {{idle}}, {{active-restore}}, {{standby-restore}},
{{checkpoint}}. The observers of these metrics, when needed, could then derive
the actual ratio of time spent on each operation for example as
{{checkpoint-ratio = checkpoint-duration-rate / (idle-duration-rate +
active-restore-duration-rate + standby-restore-duration-rate +
checkpoint-duration-rate)}}. Or by performing an analogical calculation on
deltas of the {{total}} metrics.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)