Re: [DISCUSS] KIP-1068: KIP-1068: New JMX Metrics for AsyncKafkaConsumer

Lianet M. Tue, 23 Jul 2024 14:19:26 -0700

Hey all,

Follow-up on Bruno's point BC2. I personally did not suggest
background-event-queue-time-max and background-event-queue-time-avg mainly
because in practice we only have 2 types of events flowing from the
background to the app thread: errorEvents and callbackEvents, (vs the many
api-triggered events that flow from the app thread to the background). If
we get deeper into those, the error events are actually very light (from
the processing point of view), and the callbacks events already have their
specific metrics (inherited from the legacy consumer). So that was my
reasoning for not jumping right away into metrics for the background events
queue.


That being said, I do like the symmetry Bruno mentions, and see the value
in having visibility on both queues. The app queue would probably be the
busiest but we need to consider that the background events set may grow as
we integrate other features like Queues (ex. with its ack commits
callbacks), so ok with me if we prefer to have a view on both queues.

Lianet


On Tue, Jul 23, 2024 at 4:04 PM Brenden Deluna <[email protected]>
wrote:

> Hi Bruno,
> Thank you for your comments.
>
> BC1. I think that is a great point, I was not aware that we could not add
> the metrics based on which protocol is being used. I will update the KIP to
> reflect that change.
>
> BC2. There is not any specific reason for this, really it has just never
> been suggested in this thread. I will add them to get some opinions on if
> those would be useful and will go with the group consensus.
>
> Thank you,
> Brenden
>
> On Tue, Jul 23, 2024 at 11:03 AM Bruno Cadonna <[email protected]> wrote:
>
> > Hi Brenden,
> >
> > BC1. In his first e-mail Andrew wrote "I would expect that the metrics
> > do not exist at all". I agree with him. I think it would be better to
> > not add those metrics at all if the CLASSIC protocol is used rather than
> > the metrics exist and are all constant 0. This should be possible by not
> > adding the metrics to the metrics registry if the CONSUMER protocol is
> > not used.
> >
> > BC2. Is there a specific reason you do not propose
> > background-event-queue-time-max and background-event-queue-time-avg? If
> > folk think those are not useful we do not need to add them. However, if
> > those are not useful, why is background-event-queue-size useful. I was
> > just wondering about the asymmetry between background-event-queue and
> > application-event-queue.
> >
> > Best,
> > Bruno
> >
> >
> >
> > On 7/19/24 9:14 PM, Brenden Deluna wrote:
> > > Hi Apoorv,
> > > Thank you for your comments, I will address each.
> > >
> > > AM1. I can see the usefulness in also having an
> > > 'application-event-queue-age-max' to get an idea of outliers and how
> they
> > > may be affecting the average metric. I will add that.
> > >
> > > AM2. I agree with you there, I think 'time' is a better descriptor here
> > > than 'age'. I will update those metric names as well.
> > >
> > > AM3. Similar to above comments, I will change the name of that metric
> to
> > be
> > > more consistent. And I think a max metric would also be useful here,
> > adding
> > > that.
> > >
> > > AM4. Yes, good catch there. Will update that as well.
> > >
> > > Thank you,
> > > Brenden
> > >
> > > On Fri, Jul 19, 2024 at 8:14 AM Apoorv Mittal <
> [email protected]>
> > > wrote:
> > >
> > >> Hi Brendan,
> > >> Thanks for the KIP. The metrics are always helpful.
> > >>
> > >> AM1: Is `application-event-queue-age-avg` enough or do we require `
> > >> application-event-queue-age-max` as well to differentiate with
> outliers?
> > >>
> > >> AM2: The kafka producer defines metric `record-queue-time-avg` which
> > >> captures the time spent in the buffer. Do you think we should have a
> > >> similar name for `application-event-queue-age-avg` i.e. change to `
> > >> application-event-queue-time-avg`? Moreover other than similar naming,
> > >> `time` anyways seems more suitable than `age`, though minor. The
> `time`
> > >> usage is also aligned with the description of this metric.
> > >>
> > >> AM3: Metric `application-event-processing-time` says "the average
> time,
> > >> that the consumer network.....". Shall we have the `-avg` suffix in
> the
> > >> metric as we have defined for other metrics? Also do we require the
> max
> > >> metric as well for the same?
> > >>
> > >> AM4: Is the telemetry name for `unsent-requests-queue-size` intended
> > >> as `org.apache.kafka.consumer.unsent.requests.size`,
> > >> or it should be corrected to `
> > >> org.apache.kafka.consumer.unsent.requests.queue.size`?
> > >>
> > >> AM2:
> > >> Regards,
> > >> Apoorv Mittal
> > >> +44 7721681581
> > >>
> > >>
> > >> On Mon, Jul 15, 2024 at 2:45 PM Andrew Schofield <
> > >> [email protected]>
> > >> wrote:
> > >>
> > >>> Hi Brenden,
> > >>> Thanks for the updates.
> > >>>
> > >>> AS4. I see that you’ve added `.ms` to a bunch of the metrics
> reflecting
> > >> the
> > >>> fact that they’re measured in milliseconds. However, I observe that
> > most
> > >>> metrics
> > >>> in Kafka that are measured in milliseconds, with some exceptions in
> > Kafka
> > >>> Connect
> > >>> and MirrorMaker do not follow this convention. I would tend to err on
> > the
> > >>> side of
> > >>> consistency with the existing metrics and not use `.ms`. However,
> > that’s
> > >>> just my
> > >>> opinion, so I’d be interested to know what other reviewers of the KIP
> > >>> think.
> > >>>
> > >>> Thanks,
> > >>> Andrew
> > >>>
> > >>>> On 12 Jul 2024, at 20:11, Brenden Deluna
> <[email protected]
> > >>>
> > >>> wrote:
> > >>>>
> > >>>> Hey Lianet,
> > >>>>
> > >>>> Thank you for your suggestions and feedback!
> > >>>>
> > >>>>
> > >>>> LM1. This has now been addressed.
> > >>>>
> > >>>>
> > >>>> LM2. I think that would be a valuable addition to the current set of
> > >>>> metrics, I will get that added.
> > >>>>
> > >>>>
> > >>>> LM3. Again great idea, that would certainly be helpful. Will add
> that
> > >> as
> > >>>> well.
> > >>>>
> > >>>>
> > >>>> Let me know if you have any more suggestions!
> > >>>>
> > >>>>
> > >>>> Thanks,
> > >>>>
> > >>>> Brenden
> > >>>>
> > >>>> On Fri, Jul 12, 2024 at 2:11 PM Brenden Deluna <
> [email protected]>
> > >>> wrote:
> > >>>>
> > >>>>> Hi Lucas,
> > >>>>>
> > >>>>> Thank you for the feedback! I have addressed your comments:
> > >>>>>
> > >>>>>
> > >>>>> LB1. Good catch there, I will update the names as needed.
> > >>>>>
> > >>>>>
> > >>>>> LB2. Good catch again! I will update the name to be more
> consistent.
> > >>>>>
> > >>>>>
> > >>>>> LB3. Thank you for pointing this out, I realized that all metric
> > >> values
> > >>>>> will actually be set to 0. I will specifiy this and explain why
> they
> > >>> will
> > >>>>> be 0.
> > >>>>>
> > >>>>>
> > >>>>> Nit: This metric is referring to the queue of unsent requests in
> the
> > >>>>> NetworkClientDelegate. For the metric descriptions I am trying to
> not
> > >>>>> include too much of the implementation details, hence the reason
> that
> > >>>>> description is quite short. I cannot think of other ways to
> describe
> > >> the
> > >>>>> metric without going deeper into the implementation, but please do
> > let
> > >>> me
> > >>>>> know if you have any ideas.
> > >>>>>
> > >>>>>
> > >>>>> Thank you,
> > >>>>>
> > >>>>> Brenden
> > >>>>>
> > >>>>> On Fri, Jul 12, 2024 at 1:27 PM Lianet M. <[email protected]>
> > wrote:
> > >>>>>
> > >>>>>> Hey Brenden, thanks for the KIP! Great to get more visibility into
> > >> the
> > >>> new
> > >>>>>> consumer.
> > >>>>>>
> > >>>>>> LM1. +1 on Lucas's suggestion for including the unit in the name,
> > >> seems
> > >>>>>> clearer and consistent (I do see several time metrics including
> ms)
> > >>>>>>
> > >>>>>> LM2. What about a new metric for application-event-queue-time-ms.
> It
> > >>> would
> > >>>>>> be a complement to the application-event-queue-size you're
> > proposing,
> > >>> and
> > >>>>>> it will tell us how long the events sit in the queue, waiting to
> be
> > >>>>>> processed (from the time the API call adds the event to the queue,
> > to
> > >>> the
> > >>>>>> time it's processed in the background thread). I find it would be
> > >> very
> > >>>>>> interesting.
> > >>>>>>
> > >>>>>> LM3. Thinking about the actual usage of
> > >>>>>> "time-between-network-thread-poll-xxx" metric, I imagine it would
> be
> > >>>>>> helpful to know more about what could be impacting it. As I see
> it,
> > >> the
> > >>>>>> network thread cadence could be mainly impacted by: 1- app event
> > >>>>>> processing
> > >>>>>> (generate requests), 2- network client poll (actual send/receive).
> > >> For
> > >>> 2,
> > >>>>>> the new consumer reuses the same component as the legacy one, but
> 1
> > >> is
> > >>>>>> specific to the new consumer, so what about a metric
> > >>>>>> for application-event-processing-time-ms (we could consider avg I
> > >> would
> > >>>>>> say). It would be the time that the network thread takes to
> process
> > >> all
> > >>>>>> available events on each run.
> > >>>>>>
> > >>>>>> Cheers!
> > >>>>>> Lianet
> > >>>>>>
> > >>>>>> On Fri, Jul 12, 2024 at 1:57 PM Lucas Brutschy
> > >>>>>> <[email protected]> wrote:
> > >>>>>>
> > >>>>>>> Hey Brenden,
> > >>>>>>>
> > >>>>>>> thanks for the KIP! These will be great to observe and debug the
> > >>>>>>> background thread of the new consumer.
> > >>>>>>>
> > >>>>>>> LB1. `time-between-network-thread-poll-max` → I see several
> similar
> > >>>>>>> metrics including the unit in the metric name (ms or us). We
> could
> > >>>>>>> consider this, although it's probably not strictly required.
> > >> However,
> > >>>>>>> at least the description should state the unit. Same for the
> `avg`
> > >>>>>>> version.
> > >>>>>>> LB2. `unsent-requests-size` → Naming sounds a bit like it's
> > >> referring
> > >>>>>>> to the size of the request. How about `unsent-request-queue-size`
> > or
> > >>>>>>> `unsent-request-count` or simply `unsent-requests`?
> > >>>>>>> LB3. "the proposed metrics below will be set to null or 0." →
> which
> > >>>>>>> one will be set to null and which ones will be set to 0, and why?
> > >>>>>>>
> > >>>>>>> nit: "The current number of unsent requests in the consumer
> > >> network" →
> > >>>>>>> Seems to be missing something?
> > >>>>>>>
> > >>>>>>> Cheers,
> > >>>>>>> Lucas
> > >>>>>>>
> > >>>>>>> On Fri, Jul 12, 2024 at 7:28 PM Brenden Deluna
> > >>>>>>> <[email protected]> wrote:
> > >>>>>>>>
> > >>>>>>>> Hi Andrew,
> > >>>>>>>> Thank you for the feedback and your question.
> > >>>>>>>>
> > >>>>>>>> AS1. Great idea, I will get that added.
> > >>>>>>>>
> > >>>>>>>> AS2. For unsent-events-age-max, age will be calculated once the
> > >> event
> > >>>>>> is
> > >>>>>>>> sent, so you are correct.
> > >>>>>>>>
> > >>>>>>>> AS3. I agree, I think that would be a helpful metric to add,
> thank
> > >>>>>> you! I
> > >>>>>>>> will get that added.
> > >>>>>>>>
> > >>>>>>>> Please let me know if you have any additional feedback,
> > >> suggestions,
> > >>>>>> or
> > >>>>>>>> questions.
> > >>>>>>>>
> > >>>>>>>> Thanks,
> > >>>>>>>> Brenden
> > >>>>>>>>
> > >>>>>>>> On Fri, Jul 12, 2024 at 11:45 AM Andrew Schofield <
> > >>>>>>> [email protected]>
> > >>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> Hi Brenden,
> > >>>>>>>>> Thanks for the KIP. It fills a gap in the metrics for the new
> > >>>>>> consumer
> > >>>>>>>>> nicely.
> > >>>>>>>>>
> > >>>>>>>>> AS1. If using the CLASSIC protocol, and thus the
> > >>>>>> LegacyKafkaConsumer,
> > >>>>>>>>> I would expect that the metrics do not exist at all. Maybe say
> > >>>>>>> something
> > >>>>>>>>> like
> > >>>>>>>>> “These metrics are for the new consumer implementation using
> the
> > >>>>>>>>> CONSUMER protocol”.
> > >>>>>>>>>
> > >>>>>>>>> AS2. For unsent-events-age-max, when is the age calculated? For
> > >>>>>>> example,
> > >>>>>>>>> is it calculated at the time that the unsent event is removed
> > from
> > >>>>>> the
> > >>>>>>>>> list and sent, or does the metric reflect unsent events which
> are
> > >>>>>> still
> > >>>>>>>>> enqueued? I suspect the former, but thought I’d check.
> > >>>>>>>>>
> > >>>>>>>>> AS3. I think that unsent-events-age-avg would also be
> interesting
> > >> to
> > >>>>>>>>> get an idea of how long unsent events tend to sit around before
> > >>>>>>> sending.
> > >>>>>>>>> Of course, the question from AS2 would also apply to the
> average.
> > >>>>>>>>>
> > >>>>>>>>> Thanks,
> > >>>>>>>>> Andrew
> > >>>>>>>>>
> > >>>>>>>>>> On 10 Jul 2024, at 17:44, Philip Nee
> <[email protected]
> > >
> > >>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>> Hi all,
> > >>>>>>>>>>
> > >>>>>>>>>> This is the link to the KIP document.
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1068%3A+New+JMX+metrics+for+the+new+KafkaConsumer
> > >>>>>>>>>>
> > >>>>>>>>>> Any comment is appreciated,
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> On Tue, Jul 9, 2024 at 10:14 AM Brenden Deluna
> > >>>>>>>>> <[email protected]>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Hello everyone,
> > >>>>>>>>>>>
> > >>>>>>>>>>> I would like to start the discussion thread for KIP-1068.
> This
> > >>>>>> is a
> > >>>>>>>>>>> relatively small KIP, only proposing to add a couple of new
> > >>>>>> metrics.
> > >>>>>>>>>>>
> > >>>>>>>>>>> If you have any suggestions or feedback, let me know, it will
> > be
> > >>>>>>> much
> > >>>>>>>>>>> appreciated.
> > >>>
> > >>>
> > >>>
> > >>
> > >
> >
>

Re: [DISCUSS] KIP-1068: KIP-1068: New JMX Metrics for AsyncKafkaConsumer

Reply via email to