Re: [VOTE] KIP-714: Client metrics and observability

Philip Nee Wed, 13 Sep 2023 09:50:40 -0700

Hey Andrew -

Thank you for taking the time to reply to my questions. I'm just adding
some notes to this discussion.


1. epoch: It can be helpful to know the delta of the client side and the
actual leader epoch.  It is helpful to understand why sometimes commit
fails/client not making progress.
2. Client connection: If the client selects the "wrong" connection to push
out the data, I assume the request would timeout; which should lead to
disconnecting from the node and reselecting another node as you mentioned,
via the least loaded node.

Cheers,
P


On Tue, Sep 12, 2023 at 10:40 AM Andrew Schofield <
[email protected]> wrote:

> Hi Philip,
> Thanks for your vote and interest in the KIP.
>
> KIP-714 does not introduce any new client metrics, and that’s intentional.
> It does
> tell how that all of the client metrics can have their names transformed
> into
> equivalent "telemetry metric names”, and then potentially used in metrics
> subscriptions.
>
> I am interested in the idea of client’s leader epoch in this context, but
> I don’t have
> an immediate plan for how best to do this, and it would take another KIP
> to enhance
> existing metrics or introduce some new ones. Those would then naturally be
> applicable to the metrics push introduced in KIP-714.
>
> In a similar vein, there are no existing client metrics specifically for
> auto-commit.
> We could add them to Kafka, but I really think this is just an example of
> asynchronous
> commit in which the application has decided not to specify when the commit
> should
> begin.
>
> It is possible to increase the cadence of pushing by modifying the
> interval.ms
> configuration property of the CLIENT_METRICS resource.
>
> There is an “assigned-partitions” metric for each consumer, but not one for
> active partitions. We could add one, again as a follow-on KIP.
>
> I take your point about holding on to a connection in a channel which might
> experience congestion. Do you have a suggestion for how to improve on this?
> For example, the client does have the concept of a least-loaded node. Maybe
> this is something we should investigate in the implementation and decide
> on the
> best approach. In general, I think sticking with the same node for
> consecutive
> pushes is best, but if you choose the “wrong” node to start with, it’s not
> ideal.
>
> Thanks,
> Andrew
>
> > On 8 Sep 2023, at 19:29, Philip Nee <[email protected]> wrote:
> >
> > Hey Andrew -
> >
> > +1 but I don't have a binding vote!
> >
> > It took me a while to go through the KIP. Here are some of my notes
> during
> > the reading:
> >
> > *Metrics*
> > - Should we care about the client's leader epoch? There is a case where
> the
> > user recreates the topic, but the consumer thinks it is still the same
> > topic and therefore, attempts to start from an offset that doesn't exist.
> > KIP-848 addresses this issue, but I can still see some potential benefits
> > from knowing the client's epoch information.
> > - I assume poll idle is similar to poll interval: I needed to read the
> > description a few times.
> > - I don't have a clear use case in mind for the commit latency, but I do
> > think sometimes people lack clarity about how much progress was tracked
> by
> > the auto-commit.  Would tracking auto-commit-related metrics be useful? I
> > was thinking: the last offset committed or the actual cadence in ms.
> > - Are there cases when we need to increase the cadence of telemetry data
> > push? i.e. variable interval.
> > - Thanks for implementing the randomized initial metric push; I think it
> is
> > really important.
> > - Is there a potential use case for tracking the number of active
> > partitions? The consumer can pause partitions via API, during revocation,
> > or during offset reset for the stream.
> >
> > *Connections*:
> > - The KIP stated that it will keep the same connection until the
> connection
> > is disconnected. I wonder if that could potentially cause congestion if
> it
> > is already a busy channel, which leads to connection timeout and
> > subsequently disconnection.
> >
> > Thanks,
> > P
> >
> > On Fri, Sep 8, 2023 at 4:15 AM Andrew Schofield <
> > [email protected]> wrote:
> >
> >> Bumping the voting thread for KIP-714.
> >>
> >> So far, we have:
> >> Non-binding +2 (Milind and Kirk), non-binding -1 (Ryanne)
> >>
> >> Thanks,
> >> Andrew
> >>
> >>> On 4 Aug 2023, at 09:45, Andrew Schofield <[email protected]>
> >> wrote:
> >>>
> >>> Hi,
> >>> After almost 2 1/2 years in the making, I would like to call a vote for
> >> KIP-714 (
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
> >> ).
> >>>
> >>> This KIP aims to improve monitoring and troubleshooting of client
> >> performance by enabling clients to push metrics to brokers.
> >>>
> >>> I’d like to thank everyone that participated in the discussion,
> >> especially the librdkafka team since one of the aims of the KIP is to
> >> enable any client to participate, not just the Apache Kafka project’s
> Java
> >> clients.
> >>>
> >>> Thanks,
> >>> Andrew
>
>
>

Re: [VOTE] KIP-714: Client metrics and observability

Reply via email to