Re: [DISCUSS] KIP-761: Add total blocked time metric to streams

Guozhang Wang Mon, 12 Jul 2021 22:28:30 -0700

Hey Rohan,

Thanks for putting up the KIP. Maybe you can also briefly talk about the
context in the email message as well (they are very well explained inside
the KIP doc though).

I have a few meta and detailed thoughts after reading it.

1) I agree that the current ratio metrics is just "snapshot in point", and
more flexible metrics that would allow reporters to calculate based on
window intervals are better. However, the current mechanism of the proposed
metrics assumes the thread->clients mapping as of today, where each thread
would own exclusively one main consumer, restore consumer, producer and an
admin client. But this mapping may be subject to change in the future. Have
you thought about how this metric can be extended when, e.g. the embedded
clients and stream threads are de-coupled?

2) [This and all below are minor comments] The "flush-time-total" may
better be a producer client metric, as "flush-wait-time-total", than a
streams metric, though the streams-level "total-blocked" can still leverage
it. Similarly, I think "txn-commit-time-total" and
"offset-commit-time-total" may better be inside producer and consumer
clients respectively.

3) The doc was not very clear on how "thread-start-time" would be needed
when calculating streams utilization along with total-blocked time, could
you elaborate a bit more in the KIP?

4) For "txn-commit-time-total" specifically, besides producer.commitTxn,
other txn-related calls may also be blocking, including
producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total"
later in the doc, but did not include it as a separate metric, and
similarly, should we have a `txn-abort-time-total` as well? If yes, could
you update the KIP page accordingly.

5) Not a suggestion, but just wanted to bring up that the producer related
metrics are a bit "coarsen" compared with the consumer/admin clients since
their IO mechanisms are a bit different: for producer the caller thread
does not do any IOs since it's all delegated to the background sender
thread, while for consumer/admin the caller thread would still need to do
some IOs, and hence the selector-level metrics would make sense. On top of
my head I cannot think of a better measuring mechanism for producers
either, especially for txn-related ones, we may need to experiment and see
if the generated ratio is relatively accurate and reasonable with the
reflected "block time".

Guozhang

On Mon, Jul 12, 2021 at 12:01 PM Rohan Desai <desai.p.ro...@gmail.com>
wrote:

>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-761%3A+Add+Total+Blocked+Time+Metric+to+Streams
>

-- 
-- Guozhang

Re: [DISCUSS] KIP-761: Add total blocked time metric to streams

Reply via email to