Hey Rohan, Thanks for putting up the KIP. Maybe you can also briefly talk about the context in the email message as well (they are very well explained inside the KIP doc though).
I have a few meta and detailed thoughts after reading it. 1) I agree that the current ratio metrics is just "snapshot in point", and more flexible metrics that would allow reporters to calculate based on window intervals are better. However, the current mechanism of the proposed metrics assumes the thread->clients mapping as of today, where each thread would own exclusively one main consumer, restore consumer, producer and an admin client. But this mapping may be subject to change in the future. Have you thought about how this metric can be extended when, e.g. the embedded clients and stream threads are de-coupled? 2) [This and all below are minor comments] The "flush-time-total" may better be a producer client metric, as "flush-wait-time-total", than a streams metric, though the streams-level "total-blocked" can still leverage it. Similarly, I think "txn-commit-time-total" and "offset-commit-time-total" may better be inside producer and consumer clients respectively. 3) The doc was not very clear on how "thread-start-time" would be needed when calculating streams utilization along with total-blocked time, could you elaborate a bit more in the KIP? 4) For "txn-commit-time-total" specifically, besides producer.commitTxn, other txn-related calls may also be blocking, including producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total" later in the doc, but did not include it as a separate metric, and similarly, should we have a `txn-abort-time-total` as well? If yes, could you update the KIP page accordingly. 5) Not a suggestion, but just wanted to bring up that the producer related metrics are a bit "coarsen" compared with the consumer/admin clients since their IO mechanisms are a bit different: for producer the caller thread does not do any IOs since it's all delegated to the background sender thread, while for consumer/admin the caller thread would still need to do some IOs, and hence the selector-level metrics would make sense. On top of my head I cannot think of a better measuring mechanism for producers either, especially for txn-related ones, we may need to experiment and see if the generated ratio is relatively accurate and reasonable with the reflected "block time". Guozhang On Mon, Jul 12, 2021 at 12:01 PM Rohan Desai <desai.p.ro...@gmail.com> wrote: > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-761%3A+Add+Total+Blocked+Time+Metric+to+Streams > -- -- Guozhang