Re: [DISCUSS] KIP-714: Client metrics and observability

Andrew Schofield Tue, 03 Oct 2023 09:49:33 -0700

Hi David,
Thanks for your interest in KIP-714.

Because this KIP is under development at the same time as KIP-848, it will
need to support both the existing KafkaConsumer code and the refactored code
being worked on under KIP-848. I’ve updated the Threading section accordingly.


Thanks,
Andrew

> On 30 Sep 2023, at 01:45, David Jacot <dja...@confluent.io.INVALID> wrote:
>
> Hi Andrew,
>
> Thanks for driving this one. I haven't read all the KIP yet but I already
> have an initial question. In the Threading section, it is written
> "KafkaConsumer: the "background" thread (based on the consumer threading
> refactor which is underway)". If I understand this correctly, it means
> that KIP-714 won't work if the "old consumer" is used. Am I correct?
>
> Cheers,
> David
>
>
> On Fri, Sep 22, 2023 at 12:18 PM Andrew Schofield <
> andrew_schofield_j...@outlook.com> wrote:
>
>> Hi Philip,
>> No, I do not think it should actively search for a broker that supports
>> the new
>> RPCs. In general, either all of the brokers or none of the brokers will
>> support it.
>> In the window, where the cluster is being upgraded or client telemetry is
>> being
>> enabled, there might be a mixed situation. I wouldn’t put too much effort
>> into
>> this mixed scenario. As the client finds brokers which support the new
>> RPCs,
>> it can begin to follow the KIP-714 mechanism.
>>
>> Thanks,
>> Andrew
>>
>>> On 22 Sep 2023, at 20:01, Philip Nee <philip...@gmail.com> wrote:
>>>
>>> Hi Andrew -
>>>
>>> Question on top of your answers: Do you think the client should actively
>>> search for a broker that supports this RPC? As previously mentioned, the
>>> broker uses the leastLoadedNode to find its first connection (am
>>> I correct?), and what if that broker doesn't support the metric push?
>>>
>>> P
>>>
>>> On Fri, Sep 22, 2023 at 10:20 AM Andrew Schofield <
>>> andrew_schofield_j...@outlook.com> wrote:
>>>
>>>> Hi Kirk,
>>>> Thanks for your question. You are correct that the presence or absence
>> of
>>>> the new RPCs in the
>>>> ApiVersionsResponse tells the client whether to request the telemetry
>>>> subscriptions and push
>>>> metrics.
>>>>
>>>> This is of course tricky in practice. It would be conceivable, as a
>>>> cluster is upgraded to AK 3.7
>>>> or as a client metrics receiver plugin is deployed across the cluster,
>>>> that a client connects to some
>>>> brokers that support the new RPCs and some that do not.
>>>>
>>>> Here’s my suggestion:
>>>> * If a client is not connected to any brokers that support in the new
>>>> RPCs, it cannot push metrics.
>>>> * If a client is only connected to brokers that support the new RPCs, it
>>>> will use the new RPCs in
>>>> accordance with the KIP.
>>>> * If a client is connected to some brokers that support the new RPCs and
>>>> some that do not, it will
>>>> use the new RPCs with the supporting subset of brokers in accordance
>> with
>>>> the KIP.
>>>>
>>>> Comments?
>>>>
>>>> Thanks,
>>>> Andrew
>>>>
>>>>> On 22 Sep 2023, at 16:01, Kirk True <k...@kirktrue.pro> wrote:
>>>>>
>>>>> Hi Andrew/Jun,
>>>>>
>>>>> I want to make sure I understand question/comment #119… In the case
>>>> where a cluster without a metrics client receiver is later reconfigured
>> and
>>>> restarted to include a metrics client receiver, do we want the client to
>>>> thereafter begin pushing metrics to the cluster? From Andrew’s response
>> to
>>>> question #119, it sounds like we’re using the presence/absence of the
>>>> relevant RPCs in ApiVersionsResponse as the to-push-or-not-to-push
>>>> indicator. Do I have that correct?
>>>>>
>>>>> Thanks,
>>>>> Kirk
>>>>>
>>>>>> On Sep 21, 2023, at 7:42 AM, Andrew Schofield <
>>>> andrew_schofield_j...@outlook.com> wrote:
>>>>>>
>>>>>> Hi Jun,
>>>>>> Thanks for your comments. I’ve updated the KIP to clarify where
>>>> necessary.
>>>>>>
>>>>>> 110. Yes, agree. The motivation section mentions this.
>>>>>>
>>>>>> 111. The replacement of ‘-‘ with ‘.’ for metric names and the
>>>> replacement of
>>>>>> ‘-‘ with ‘_’ for attribute keys is following the OTLP guidelines. I
>>>> think it’s a bit
>>>>>> of a debatable point. OTLP makes a distinction between a namespace
>> and a
>>>>>> multi-word component. If it was “client.id” then “client” would be a
>>>> namespace with
>>>>>> an attribute key “id”. But “client_id” is just a key. So, it was
>>>> intentional, but debatable.
>>>>>>
>>>>>> 112. Thanks. The link target moved. Fixed.
>>>>>>
>>>>>> 113. Thanks. Fixed.
>>>>>>
>>>>>> 114.1. If a standard metric makes sense for a client, it should use
>> the
>>>> exact same
>>>>>> name. If a standard metric doesn’t make sense for a client, then it
>> can
>>>> omit that metric.
>>>>>>
>>>>>> For a required metric, the situation is stronger. All clients must
>>>> implement these
>>>>>> metrics with these names in order to implement the KIP. But the
>>>> required metrics
>>>>>> are essentially the number of connections and the request latency,
>>>> which do not
>>>>>> reference the underlying implementation of the client (which
>>>> producer.record.queue.time.max
>>>>>> of course does).
>>>>>>
>>>>>> I suppose someone might build a producer-only client that didn’t have
>>>> consumer metrics.
>>>>>> In this case, the consumer metrics would conceptually have the value 0
>>>> and would not
>>>>>> need to be sent to the broker.
>>>>>>
>>>>>> 114.2. If a client does not implement some metrics, they will not be
>>>> available for
>>>>>> analysis and troubleshooting. It just makes the ability to combine
>>>> metrics from lots
>>>>>> different clients less complete.
>>>>>>
>>>>>> 115. I think it was probably a mistake to be so specific about
>>>> threading in this KIP.
>>>>>> When the consumer threading refactor is complete, of course, it would
>>>> do the appropriate
>>>>>> equivalent. I’ve added a clarification and massively simplified this
>>>> section.
>>>>>>
>>>>>> 116. I removed “client.terminating”.
>>>>>>
>>>>>> 117. Yes. Horrid. Fixed.
>>>>>>
>>>>>> 118. The Terminating flag just indicates that this is the final
>>>> PushTelemetryRequest
>>>>>> from this client. Any subsequent request will be rejected. I think
>> this
>>>> flag should remain.
>>>>>>
>>>>>> 119. Good catch. This was actually contradicting another part of the
>>>> KIP. The current behaviour
>>>>>> is indeed preserved. If the broker doesn’t have a client metrics
>>>> receiver plugin, the new RPCs
>>>>>> in this KIP are “turned off” and not reported in ApiVersionsResponse.
>>>> The client will not
>>>>>> attempt to push metrics.
>>>>>>
>>>>>> 120. The error handling table lists the error codes for
>>>> PushTelemetryResponse. I’ve added one
>>>>>> but it looked good to me. GetTelemetrySubscriptions doesn’t have any
>>>> error codes, since the
>>>>>> situation in which the client telemetry is not supported is handled by
>>>> the RPCs not being offered
>>>>>> by the broker.
>>>>>>
>>>>>> 121. Again, I think it’s probably a mistake to be specific about
>>>> threading. Removed.
>>>>>>
>>>>>> 122. Good catch. For DescribeConfigs, the ACL operation should be
>>>>>> “DESCRIBE_CONFIGS”. For AlterConfigs, the ACL operation should be
>>>>>> “ALTER” (not “WRITE” as it said). The checks are made on the CLUSTER
>>>>>> resource.
>>>>>>
>>>>>> Thanks for the detailed review.
>>>>>>
>>>>>> Thanks,
>>>>>> Andrew
>>>>>>
>>>>>>>
>>>>>>> 110. Another potential motivation is the multiple clients support.
>>>> Some of
>>>>>>> the places may not have good monitoring support for non-java clients.
>>>>>>>
>>>>>>> 111. OpenTelemetry Naming: We replace '-' with '.' for metric name
>> and
>>>>>>> replace '-' with '_' for attributes. Why is the inconsistency?
>>>>>>>
>>>>>>> 112. OTLP specification: Page is not found from the link.
>>>>>>>
>>>>>>> 113. "Defining standard and required metrics makes the monitoring and
>>>>>>> troubleshooting of clients from various client types ": Incomplete
>>>> sentence.
>>>>>>>
>>>>>>> 114. standard/required metrics
>>>>>>> 114.1 Do other clients need to implement those metrics with the exact
>>>> same
>>>>>>> names?
>>>>>>> 114.2 What happens if some of those metrics are missing from a
>> client?
>>>>>>>
>>>>>>> 115. "KafkaConsumer: both the "heart beat" and application threads":
>> We
>>>>>>> have an ongoing effort to refactor the consumer threading model (
>>>>>>>
>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/Consumer+threading+refactor+design
>>>> ).
>>>>>>> Once this is done, PRC requests will only be made from the background
>>>>>>> thread. Should this KIP follow the new model only?
>>>>>>>
>>>>>>> 116. 'The metrics should contain the reason for the client
>> termination
>>>> by
>>>>>>> including the client.terminating metric with the label “reason” ...'.
>>>> Hmm,
>>>>>>> are we introducing a new metric client.terminating? If so, that needs
>>>> to be
>>>>>>> explicitly listed.
>>>>>>>
>>>>>>> 117. "As the metrics plugin may need to add additional metrics on top
>>>> of
>>>>>>> this the generic metrics receiver in the broker will not add these
>>>> labels
>>>>>>> but rely on the plugins to do so," The sentence doesn't read well.
>>>>>>>
>>>>>>> 118. "it is possible for the client to send at most one accepted
>>>>>>> out-of-profile per connection before the rate-limiter kicks in": If
>> we
>>>> do
>>>>>>> this, do we still need the Terminating flag in
>> PushTelemetryRequestV0?
>>>>>>>
>>>>>>> 119. "If there is no client metrics receiver plugin configured on the
>>>>>>> broker, it will respond to GetTelemetrySubscriptionsRequest with
>>>>>>> RequestedMetrics set to Null and a -1 SubscriptionId. The client
>> should
>>>>>>> send a new GetTelemetrySubscriptionsRequest after the PushIntervalMs
>>>> has
>>>>>>> expired. This allows the metrics receiver to be enabled or disabled
>>>> without
>>>>>>> having to restart the broker or reset the client connection."
>>>>>>> "no client metrics receiver plugin configured" is defined by no
>> metric
>>>>>>> reporter implementing the ClientTelemetry interface, right? In that
>>>> case,
>>>>>>> it would be useful to avoid the clients sending
>>>>>>> GetTelemetrySubscriptionsRequest periodically to preserve the current
>>>>>>> behavior.
>>>>>>>
>>>>>>> 120. GetTelemetrySubscriptionsResponseV0 and PushTelemetryRequestV0:
>>>> Could
>>>>>>> we list error codes for each?
>>>>>>>
>>>>>>> 121. "ClientTelemetryReceiver.ClientTelemetryReceiver This method may
>>>> be
>>>>>>> called from the request handling thread": Where else can this method
>> be
>>>>>>> called?
>>>>>>>
>>>>>>> 122. DescribeConfigs/AlterConfigs already exist. Are we changing the
>>>> ACL?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Jun
>>>>>>>
>>>>>>> On Mon, Jul 31, 2023 at 4:33 AM Andrew Schofield <
>>>>>>> andrew_schofield_j...@outlook.com> wrote:
>>>>>>>
>>>>>>>> Hi Milind,
>>>>>>>> Thanks for your question.
>>>>>>>>
>>>>>>>> On reflection, I agree that INVALID_RECORD is most likely to be
>>>> caused by a
>>>>>>>> problem in the serialization in the client. I have changed the
>> client
>>>>>>>> action in this case
>>>>>>>> to “Log an error and stop pushing metrics”.
>>>>>>>>
>>>>>>>> I have updated the KIP text accordingly.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Andrew
>>>>>>>>
>>>>>>>>> On 31 Jul 2023, at 12:09, Milind Luthra
>>>> <milut...@confluent.io.INVALID>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi Andrew,
>>>>>>>>> Thanks for the clarifications.
>>>>>>>>>
>>>>>>>>> About 2b:
>>>>>>>>> In case a client has a bug while serializing, it might be difficult
>>>> for
>>>>>>>> the
>>>>>>>>> client to recover from that without code changes. In that, it might
>>>> be
>>>>>>>> good
>>>>>>>>> to just log the INVALID_RECORD as an error, and treat the error as
>>>> fatal
>>>>>>>>> for the client (only fatal in terms of sending the metrics, the
>>>> client
>>>>>>>> can
>>>>>>>>> keep functioning otherwise). What do you think?
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Milind
>>>>>>>>>
>>>>>>>>> On Mon, Jul 24, 2023 at 8:18 PM Andrew Schofield <
>>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Milind,
>>>>>>>>>> Thanks for your questions about the KIP.
>>>>>>>>>>
>>>>>>>>>> 1) I did some archaeology and looked at historical versions of the
>>>> KIP.
>>>>>>>> I
>>>>>>>>>> think this is
>>>>>>>>>> just a mistake. 5 minutes is the default metric push interval. 30
>>>>>>>> minutes
>>>>>>>>>> is a mystery
>>>>>>>>>> to me. I’ve updated the KIP.
>>>>>>>>>>
>>>>>>>>>> 2) I think there are two situations in which INVALID_RECORD might
>>>> occur.
>>>>>>>>>> a) The client might perhaps be using a content-type that the
>> broker
>>>> does
>>>>>>>>>> not support.
>>>>>>>>>> The KIP mentions content-type as a future extension, but there’s
>>>> only
>>>>>>>> one
>>>>>>>>>> supported
>>>>>>>>>> to start with. Until we have multiple content-types, this seems
>> out
>>>> of
>>>>>>>>>> scope. I think a
>>>>>>>>>> future KIP would add another error code for this.
>>>>>>>>>> b) The client might perhaps have a bug which means the metrics
>>>> payload
>>>>>>>> is
>>>>>>>>>> malformed.
>>>>>>>>>> Logging a warning and attempting the next metrics push on the push
>>>>>>>>>> interval seems
>>>>>>>>>> appropriate.
>>>>>>>>>>
>>>>>>>>>> UNKNOWN_SUBSCRIPTION_ID would indeed be handled by making an
>>>> immediate
>>>>>>>>>> GetTelemetrySubscriptionsRequest.
>>>>>>>>>>
>>>>>>>>>> UNSUPPORTED_COMPRESSION_TYPE seems like either a client bug or
>>>> perhaps
>>>>>>>>>> a situation in which a broker sends a compression type in a
>>>>>>>>>> GetTelemetrySubscriptionsResponse
>>>>>>>>>> which is subsequently not supported when its used with a
>>>>>>>>>> PushTelemetryRequest.
>>>>>>>>>> We do want the client to have the opportunity to get an up-to-date
>>>> list
>>>>>>>> of
>>>>>>>>>> supported
>>>>>>>>>> compression types. I think an immediate
>>>> GetTelemetrySubscriptionsRequest
>>>>>>>>>> is appropriate.
>>>>>>>>>>
>>>>>>>>>> 3) If a client attempts a subsequent handshake with a Null
>>>>>>>>>> ClientInstanceId, the
>>>>>>>>>> receiving broker may not already know the client's existing
>>>>>>>>>> ClientInstanceId. If the
>>>>>>>>>> receiving broker knows the existing ClientInstanceId, it simply
>>>> responds
>>>>>>>>>> the existing
>>>>>>>>>> value back to the client. If it does not know the existing
>>>>>>>>>> ClientInstanceId, it will create
>>>>>>>>>> a new client instance ID and respond with that.
>>>>>>>>>>
>>>>>>>>>> I will update the KIP with these clarifications.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Andrew
>>>>>>>>>>
>>>>>>>>>>> On 17 Jul 2023, at 14:21, Milind Luthra
>>>> <milut...@confluent.io.INVALID
>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Andrew, thanks for this KIP.
>>>>>>>>>>>
>>>>>>>>>>> I had a few questions regarding the "Error handling" section.
>>>>>>>>>>>
>>>>>>>>>>> 1. It mentions that "The 5 and 30 minute retries are to
>> eventually
>>>>>>>>>> trigger
>>>>>>>>>>> a retry and avoid having to restart clients if the cluster
>> metrics
>>>>>>>>>>> configuration is disabled temporarily, e.g., by operator error,
>>>> rolling
>>>>>>>>>>> upgrades, etc."
>>>>>>>>>>> But this 30 min interval isn't mentioned anywhere else. What is
>> it
>>>>>>>>>>> referring to?
>>>>>>>>>>>
>>>>>>>>>>> 2. For the actual errors:
>>>>>>>>>>> INVALID_RECORD : The action required is to "Log a warning to the
>>>>>>>>>>> application and schedule the next
>> GetTelemetrySubscriptionsRequest
>>>> to 5
>>>>>>>>>>> minutes". Why is this 5 minutes, and not something like
>>>> PushIntervalMs?
>>>>>>>>>> And
>>>>>>>>>>> also, why are we scheduling a GetTelemetrySubscriptionsRequest in
>>>> this
>>>>>>>>>>> case, if the serialization is broken?
>>>>>>>>>>> UNKNOWN_SUBSCRIPTION_ID , UNSUPPORTED_COMPRESSION_TYPE : just to
>>>>>>>> confirm,
>>>>>>>>>>> the GetTelemetrySubscriptionsRequest needs to be scheduled
>>>> immediately
>>>>>>>>>>> after the PushTelemetry response, correct?
>>>>>>>>>>>
>>>>>>>>>>> 3. For "Subsequent GetTelemetrySubscriptionsRequests must include
>>>> the
>>>>>>>>>>> ClientInstanceId returned in the first response, regardless of
>>>> broker":
>>>>>>>>>>> Will a broker error be returned in case some implementation of
>>>> this KIP
>>>>>>>>>>> violates this accidentally and sends a request with
>>>> ClientInstanceId =
>>>>>>>>>> Null
>>>>>>>>>>> even when it's been obtained already? Or will a new
>>>> ClientInstanceId be
>>>>>>>>>>> returned without an error?
>>>>>>>>>>>
>>>>>>>>>>> Thanks!
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jun 13, 2023 at 8:38 PM Andrew Schofield <
>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>> I would like to start a new discussion thread on KIP-714: Client
>>>>>>>> metrics
>>>>>>>>>>>> and observability.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
>>>>>>>>>>>>
>>>>>>>>>>>> I have edited the proposal significantly to reduce the scope.
>> The
>>>>>>>>>> overall
>>>>>>>>>>>> mechanism for client metric subscriptions is unchanged, but the
>>>>>>>>>>>> KIP is now based on the existing client metrics, rather than
>>>>>>>> introducing
>>>>>>>>>>>> new metrics. The purpose remains helping cluster operators
>>>>>>>>>>>> investigate performance problems experienced by clients without
>>>>>>>>>> requiring
>>>>>>>>>>>> changes to the client application code or configuration.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Andrew
>>
>>
>>

Re: [DISCUSS] KIP-714: Client metrics and observability

Reply via email to