[ https://issues.apache.org/jira/browse/KAFKA-18369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17909788#comment-17909788 ]
Rafał Sumisławski commented on KAFKA-18369: ------------------------------------------- Hi [~lucasbru] IMO changing it from average to value doesn't solve the problem. I was about to open a similar ticket about the `StreamThread` metrics, just didn't find the time yet. The explanation of the issue is different, but the end result is mostly the same. Let me explain this in the new ticket, and I'll link it here once it's open. > State updater's *-ratio metrics are incorrect > --------------------------------------------- > > Key: KAFKA-18369 > URL: https://issues.apache.org/jira/browse/KAFKA-18369 > Project: Kafka > Issue Type: Bug > Components: streams > Affects Versions: 3.8.1 > Reporter: Rafał Sumisławski > Priority: Major > > h2. Background > {{DefaultStateUpdater}} defines {{{}idle-ratio{}}}, > {{{}active-restore-ratio{}}}, {{{}standby-update-ratio{}}}, > {{checkpoint-ratio}} metrics here: > [https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/processor/internals/DefaultStateUpdater.java#L1101-L1115] > These metrics are averages, that are supposed to indicate "The fraction of > time the thread spent on \{action}". But the metrics don't actually do that. > h2. Issue > Let me explain this with an example: > For simplicity's sake, let's consider the following example involving just > {{{}standby-update-ratio{}}}, {{checkpoint-ratio}} and ignoring the existence > of the other two metrics. > Let's say the thread did: > * {{999}} iterations with {{standby-update}} taking {{1ms}} in each > iteration and no {{checkpoint}} happening ({{{}0ms{}}}). > * {{1}} iteration with {{standby-update}} taking {{1ms}} and {{checkpoint}} > taking {{9000ms}} > The thread spent {{10s}} working, of which it spent {{1s}} on > {{standby-updates}} and {{9s}} on checkpoint, so the fraction of time it > spent on checkpoint (checkpoint-ratio) is {{{}~0.001 (0.1%){}}}. Or at least > that is what the metrics will say. I would instead argue that it spent > {{9s/10s == 0.9 == 90%}} on checkpoint. If you agree with my logic, then you > agree that this metrics is incorrect. > The problem is that the code computes a ratio for each iteration, and then > averages those ratios out, producing a number devoid of statistical meaning > or practical application. It ignores the fact that the one iteration that > took {{{}9s{}}}, should have a much higher weight than those quick 1ms > iterations. > {{(999*(0ms/1ms) + 1*(9000ms/9001ms))/1000 ~= 0.001}} > h2. Solution > What we would like to see instead is either a ratio of average/total times, > not an average of ratios. I don't think this can be easily realised within > the existing metrics system. So instead, what I propose as a solution is to > report {{duration-total}} and/or {{duration-rate}} (with the unit of seconds > per second) metric for each of {{{}idle{}}}, {{{}active-restore{}}}, > {{{}standby-restore{}}}, {{{}checkpoint{}}}. The observers of these metrics, > when needed, could then derive the actual ratio of time spent on each > operation for example as {{{}checkpoint-ratio = checkpoint-duration-rate / > (idle-duration-rate + active-restore-duration-rate + > standby-restore-duration-rate + checkpoint-duration-rate){}}}. Or by > performing an analogical calculation on deltas of the {{total}} metrics. > I can submit a PR once there's an agreement on the correct way to fix this. -- This message was sent by Atlassian Jira (v8.20.10#820010)