Re: [DISCUSS] KIP-1068: KIP-1068: New JMX Metrics for AsyncKafkaConsumer

Mickael Maison Wed, 24 Jul 2024 03:06:06 -0700

Hi,

1. The title is a bit misleading. It's proposing to add new metrics,
JMX is just one of the mechanisms to export them.
2. +1 to not register the new metrics when using the classic consumer,
instead of setting them to 0. Similarly I assume existing metrics that
don't apply to the new consumers are not registered?
3. At the moment this KIP is not changing any public APIs. What's the
plan to make AsyncKafkaConsumer public?


Thanks,
Mickael



On Tue, Jul 23, 2024 at 6:03 PM Bruno Cadonna <cado...@apache.org> wrote:
>
> Hi Brenden,
>
> BC1. In his first e-mail Andrew wrote "I would expect that the metrics
> do not exist at all". I agree with him. I think it would be better to
> not add those metrics at all if the CLASSIC protocol is used rather than
> the metrics exist and are all constant 0. This should be possible by not
> adding the metrics to the metrics registry if the CONSUMER protocol is
> not used.
>
> BC2. Is there a specific reason you do not propose
> background-event-queue-time-max and background-event-queue-time-avg? If
> folk think those are not useful we do not need to add them. However, if
> those are not useful, why is background-event-queue-size useful. I was
> just wondering about the asymmetry between background-event-queue and
> application-event-queue.
>
> Best,
> Bruno
>
>
>
> On 7/19/24 9:14 PM, Brenden Deluna wrote:
> > Hi Apoorv,
> > Thank you for your comments, I will address each.
> >
> > AM1. I can see the usefulness in also having an
> > 'application-event-queue-age-max' to get an idea of outliers and how they
> > may be affecting the average metric. I will add that.
> >
> > AM2. I agree with you there, I think 'time' is a better descriptor here
> > than 'age'. I will update those metric names as well.
> >
> > AM3. Similar to above comments, I will change the name of that metric to be
> > more consistent. And I think a max metric would also be useful here, adding
> > that.
> >
> > AM4. Yes, good catch there. Will update that as well.
> >
> > Thank you,
> > Brenden
> >
> > On Fri, Jul 19, 2024 at 8:14 AM Apoorv Mittal <apoorvmitta...@gmail.com>
> > wrote:
> >
> >> Hi Brendan,
> >> Thanks for the KIP. The metrics are always helpful.
> >>
> >> AM1: Is `application-event-queue-age-avg` enough or do we require `
> >> application-event-queue-age-max` as well to differentiate with outliers?
> >>
> >> AM2: The kafka producer defines metric `record-queue-time-avg` which
> >> captures the time spent in the buffer. Do you think we should have a
> >> similar name for `application-event-queue-age-avg` i.e. change to `
> >> application-event-queue-time-avg`? Moreover other than similar naming,
> >> `time` anyways seems more suitable than `age`, though minor. The `time`
> >> usage is also aligned with the description of this metric.
> >>
> >> AM3: Metric `application-event-processing-time` says "the average time,
> >> that the consumer network.....". Shall we have the `-avg` suffix in the
> >> metric as we have defined for other metrics? Also do we require the max
> >> metric as well for the same?
> >>
> >> AM4: Is the telemetry name for `unsent-requests-queue-size` intended
> >> as `org.apache.kafka.consumer.unsent.requests.size`,
> >> or it should be corrected to `
> >> org.apache.kafka.consumer.unsent.requests.queue.size`?
> >>
> >> AM2:
> >> Regards,
> >> Apoorv Mittal
> >> +44 7721681581
> >>
> >>
> >> On Mon, Jul 15, 2024 at 2:45 PM Andrew Schofield <
> >> andrew_schofi...@live.com>
> >> wrote:
> >>
> >>> Hi Brenden,
> >>> Thanks for the updates.
> >>>
> >>> AS4. I see that you’ve added `.ms` to a bunch of the metrics reflecting
> >> the
> >>> fact that they’re measured in milliseconds. However, I observe that most
> >>> metrics
> >>> in Kafka that are measured in milliseconds, with some exceptions in Kafka
> >>> Connect
> >>> and MirrorMaker do not follow this convention. I would tend to err on the
> >>> side of
> >>> consistency with the existing metrics and not use `.ms`. However, that’s
> >>> just my
> >>> opinion, so I’d be interested to know what other reviewers of the KIP
> >>> think.
> >>>
> >>> Thanks,
> >>> Andrew
> >>>
> >>>> On 12 Jul 2024, at 20:11, Brenden Deluna <bdel...@confluent.io.INVALID
> >>>
> >>> wrote:
> >>>>
> >>>> Hey Lianet,
> >>>>
> >>>> Thank you for your suggestions and feedback!
> >>>>
> >>>>
> >>>> LM1. This has now been addressed.
> >>>>
> >>>>
> >>>> LM2. I think that would be a valuable addition to the current set of
> >>>> metrics, I will get that added.
> >>>>
> >>>>
> >>>> LM3. Again great idea, that would certainly be helpful. Will add that
> >> as
> >>>> well.
> >>>>
> >>>>
> >>>> Let me know if you have any more suggestions!
> >>>>
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Brenden
> >>>>
> >>>> On Fri, Jul 12, 2024 at 2:11 PM Brenden Deluna <bdel...@confluent.io>
> >>> wrote:
> >>>>
> >>>>> Hi Lucas,
> >>>>>
> >>>>> Thank you for the feedback! I have addressed your comments:
> >>>>>
> >>>>>
> >>>>> LB1. Good catch there, I will update the names as needed.
> >>>>>
> >>>>>
> >>>>> LB2. Good catch again! I will update the name to be more consistent.
> >>>>>
> >>>>>
> >>>>> LB3. Thank you for pointing this out, I realized that all metric
> >> values
> >>>>> will actually be set to 0. I will specifiy this and explain why they
> >>> will
> >>>>> be 0.
> >>>>>
> >>>>>
> >>>>> Nit: This metric is referring to the queue of unsent requests in the
> >>>>> NetworkClientDelegate. For the metric descriptions I am trying to not
> >>>>> include too much of the implementation details, hence the reason that
> >>>>> description is quite short. I cannot think of other ways to describe
> >> the
> >>>>> metric without going deeper into the implementation, but please do let
> >>> me
> >>>>> know if you have any ideas.
> >>>>>
> >>>>>
> >>>>> Thank you,
> >>>>>
> >>>>> Brenden
> >>>>>
> >>>>> On Fri, Jul 12, 2024 at 1:27 PM Lianet M. <liane...@gmail.com> wrote:
> >>>>>
> >>>>>> Hey Brenden, thanks for the KIP! Great to get more visibility into
> >> the
> >>> new
> >>>>>> consumer.
> >>>>>>
> >>>>>> LM1. +1 on Lucas's suggestion for including the unit in the name,
> >> seems
> >>>>>> clearer and consistent (I do see several time metrics including ms)
> >>>>>>
> >>>>>> LM2. What about a new metric for application-event-queue-time-ms. It
> >>> would
> >>>>>> be a complement to the application-event-queue-size you're proposing,
> >>> and
> >>>>>> it will tell us how long the events sit in the queue, waiting to be
> >>>>>> processed (from the time the API call adds the event to the queue, to
> >>> the
> >>>>>> time it's processed in the background thread). I find it would be
> >> very
> >>>>>> interesting.
> >>>>>>
> >>>>>> LM3. Thinking about the actual usage of
> >>>>>> "time-between-network-thread-poll-xxx" metric, I imagine it would be
> >>>>>> helpful to know more about what could be impacting it. As I see it,
> >> the
> >>>>>> network thread cadence could be mainly impacted by: 1- app event
> >>>>>> processing
> >>>>>> (generate requests), 2- network client poll (actual send/receive).
> >> For
> >>> 2,
> >>>>>> the new consumer reuses the same component as the legacy one, but 1
> >> is
> >>>>>> specific to the new consumer, so what about a metric
> >>>>>> for application-event-processing-time-ms (we could consider avg I
> >> would
> >>>>>> say). It would be the time that the network thread takes to process
> >> all
> >>>>>> available events on each run.
> >>>>>>
> >>>>>> Cheers!
> >>>>>> Lianet
> >>>>>>
> >>>>>> On Fri, Jul 12, 2024 at 1:57 PM Lucas Brutschy
> >>>>>> <lbruts...@confluent.io.invalid> wrote:
> >>>>>>
> >>>>>>> Hey Brenden,
> >>>>>>>
> >>>>>>> thanks for the KIP! These will be great to observe and debug the
> >>>>>>> background thread of the new consumer.
> >>>>>>>
> >>>>>>> LB1. `time-between-network-thread-poll-max` → I see several similar
> >>>>>>> metrics including the unit in the metric name (ms or us). We could
> >>>>>>> consider this, although it's probably not strictly required.
> >> However,
> >>>>>>> at least the description should state the unit. Same for the `avg`
> >>>>>>> version.
> >>>>>>> LB2. `unsent-requests-size` → Naming sounds a bit like it's
> >> referring
> >>>>>>> to the size of the request. How about `unsent-request-queue-size` or
> >>>>>>> `unsent-request-count` or simply `unsent-requests`?
> >>>>>>> LB3. "the proposed metrics below will be set to null or 0." → which
> >>>>>>> one will be set to null and which ones will be set to 0, and why?
> >>>>>>>
> >>>>>>> nit: "The current number of unsent requests in the consumer
> >> network" →
> >>>>>>> Seems to be missing something?
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>> Lucas
> >>>>>>>
> >>>>>>> On Fri, Jul 12, 2024 at 7:28 PM Brenden Deluna
> >>>>>>> <bdel...@confluent.io.invalid> wrote:
> >>>>>>>>
> >>>>>>>> Hi Andrew,
> >>>>>>>> Thank you for the feedback and your question.
> >>>>>>>>
> >>>>>>>> AS1. Great idea, I will get that added.
> >>>>>>>>
> >>>>>>>> AS2. For unsent-events-age-max, age will be calculated once the
> >> event
> >>>>>> is
> >>>>>>>> sent, so you are correct.
> >>>>>>>>
> >>>>>>>> AS3. I agree, I think that would be a helpful metric to add, thank
> >>>>>> you! I
> >>>>>>>> will get that added.
> >>>>>>>>
> >>>>>>>> Please let me know if you have any additional feedback,
> >> suggestions,
> >>>>>> or
> >>>>>>>> questions.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Brenden
> >>>>>>>>
> >>>>>>>> On Fri, Jul 12, 2024 at 11:45 AM Andrew Schofield <
> >>>>>>> andrew_schofi...@live.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hi Brenden,
> >>>>>>>>> Thanks for the KIP. It fills a gap in the metrics for the new
> >>>>>> consumer
> >>>>>>>>> nicely.
> >>>>>>>>>
> >>>>>>>>> AS1. If using the CLASSIC protocol, and thus the
> >>>>>> LegacyKafkaConsumer,
> >>>>>>>>> I would expect that the metrics do not exist at all. Maybe say
> >>>>>>> something
> >>>>>>>>> like
> >>>>>>>>> “These metrics are for the new consumer implementation using the
> >>>>>>>>> CONSUMER protocol”.
> >>>>>>>>>
> >>>>>>>>> AS2. For unsent-events-age-max, when is the age calculated? For
> >>>>>>> example,
> >>>>>>>>> is it calculated at the time that the unsent event is removed from
> >>>>>> the
> >>>>>>>>> list and sent, or does the metric reflect unsent events which are
> >>>>>> still
> >>>>>>>>> enqueued? I suspect the former, but thought I’d check.
> >>>>>>>>>
> >>>>>>>>> AS3. I think that unsent-events-age-avg would also be interesting
> >> to
> >>>>>>>>> get an idea of how long unsent events tend to sit around before
> >>>>>>> sending.
> >>>>>>>>> Of course, the question from AS2 would also apply to the average.
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> Andrew
> >>>>>>>>>
> >>>>>>>>>> On 10 Jul 2024, at 17:44, Philip Nee <p...@confluent.io.INVALID>
> >>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hi all,
> >>>>>>>>>>
> >>>>>>>>>> This is the link to the KIP document.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>
> >>>
> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1068%3A+New+JMX+metrics+for+the+new+KafkaConsumer
> >>>>>>>>>>
> >>>>>>>>>> Any comment is appreciated,
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Jul 9, 2024 at 10:14 AM Brenden Deluna
> >>>>>>>>> <bdel...@confluent.io.invalid>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hello everyone,
> >>>>>>>>>>>
> >>>>>>>>>>> I would like to start the discussion thread for KIP-1068. This
> >>>>>> is a
> >>>>>>>>>>> relatively small KIP, only proposing to add a couple of new
> >>>>>> metrics.
> >>>>>>>>>>>
> >>>>>>>>>>> If you have any suggestions or feedback, let me know, it will be
> >>>>>>> much
> >>>>>>>>>>> appreciated.
> >>>
> >>>
> >>>
> >>
> >

Re: [DISCUSS] KIP-1068: KIP-1068: New JMX Metrics for AsyncKafkaConsumer

Reply via email to