Thanks for the review Guozhang! responding to your feedback inline:

> 1) I agree that the current ratio metrics is just "snapshot in point", and
more flexible metrics that would allow reporters to calculate based on
window intervals are better. However, the current mechanism of the proposed
metrics assumes the thread->clients mapping as of today, where each thread
would own exclusively one main consumer, restore consumer, producer and an
admin client. But this mapping may be subject to change in the future. Have
you thought about how this metric can be extended when, e.g. the embedded
clients and stream threads are de-coupled?

Of course this depends on how exactly we refactor the runtime - assuming
that we plan to factor out consumers into an "I/O" layer that is
responsible for receiving records and enqueuing them to be processed by
processing threads, then I think it should be reasonable to count the time
we spend blocked on this internal queue(s) as blocked. The main concern
there to me is that the I/O layer would be doing something expensive like
decompression that shouldn't be counted as "blocked". But if that really is
so expensive that it starts to throw off our ratios then it's probably
indicative of a larger problem that the "i/o layer" is a bottleneck and it
would be worth refactoring so that decompression (or insert other expensive
thing here) can also be done on the processing threads.

> 2) [This and all below are minor comments] The "flush-time-total" may
better be a producer client metric, as "flush-wait-time-total", than a
streams metric, though the streams-level "total-blocked" can still leverage
it. Similarly, I think "txn-commit-time-total" and
"offset-commit-time-total" may better be inside producer and consumer
clients respectively.

Good call - I'll update the KIP

> 3) The doc was not very clear on how "thread-start-time" would be needed
when calculating streams utilization along with total-blocked time, could
you elaborate a bit more in the KIP?

Yes, will do.

> For "txn-commit-time-total" specifically, besides producer.commitTxn.
other txn-related calls may also be blocking, including
producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total"
later in the doc, but did not include it as a separate metric, and
similarly, should we have a `txn-abort-time-total` as well? If yes, could
you update the KIP page accordingly.

Ack.

On Mon, Jul 12, 2021 at 11:29 PM Rohan Desai <desai.p.ro...@gmail.com>
wrote:

> Hello All,
>
> I'd like to start a discussion on the KIP linked above which proposes some
> metrics that we would find useful to help measure whether a Kafka Streams
> application is saturated. The motivation section in the KIP goes into some
> more detail on why we think this is a useful addition to the metrics
> already implemented. Thanks in advance for your feedback!
>
> Best Regards,
>
> Rohan
>
> On Mon, Jul 12, 2021 at 12:00 PM Rohan Desai <desai.p.ro...@gmail.com>
> wrote:
>
>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-761%3A+Add+Total+Blocked+Time+Metric+to+Streams
>>
>

Reply via email to