Re: [DISCUSS] KIP-714: Client metrics and observability

Jun Rao Thu, 12 Oct 2023 10:08:15 -0700

Hi, Matthias,

130. Yes, throwing an exception sounds reasonable. It would be useful to
document this.


131. I was thinking that we could just return all consumers (including the
global consumer) through Map<String, String> consumerInstanceIds() and use
keys to identify each consumer instance. The benefit is that the
implementation (whether to use a separate global consumer or not) could
change in the future, but the API can remain the same. Another slight
benefit is that there is no need for returning Optional<String>. If the
global consumer is not used, it just won't be included in the map.

Thanks,

Jun


On Thu, Oct 12, 2023 at 9:30 AM Matthias J. Sax <mj...@apache.org> wrote:

> Thanks Sophie and Jun.
>
> `clientInstanceIds()` is fine with me -- was not sure about the double
> plural myself.
>
> Sorry if my comments was confusing. I was trying to say, that adding a
> overload to `KafkaStreams` that does not take a timeout parameter does
> not make sense, because there is no `default.api.timeout.ms` config for
> Kafka Streams, so users always need to pass in a timeout. (Same for
> producer.)
>
> For the implementation, I think KS would always call
> `client.clientInstanceId(timeout)` and never rely on
> `default.api.timeout.ms` though, so we can stay in control -- if a
> timeout is passed by the user, it would always overwrite
> `default.api.timeout.ms` on the consumer/admin and thus we should follow
> the same semantics in Kafka Streams, and overwrite it explicitly when
> calling `client.clientInstanceId()`.
>
> The proposed API also makes sense to me. I was just wondering if we want
> to extend it for client users -- for KS we won't need/use the
> timeout-less overloads.
>
>
>
> 130) My intent was to throw a TimeoutException if we cannot get all
> instanceIds, because it's the standard contract for timeouts. It would
> also be hard to tell for a user, if a full or partial result was
> returned (or we add a method `boolean isPartialResult()` to make it
> easier for users).
>
> If there is concerns/objections, I am also ok to return a partial result
> -- it would require a change to the newly added `ClientInstanceIds`
> return type -- for `adminInstanceId` we only return a `String` right now
> -- we might need to change this to `Optional<String>` so we are able to
> return a partial result?
>
>
> 131) Of course we could, but I am not sure what we would gain? In the
> end, implementation details would always leak because if we change the
> number of consumer we use, we would return different keys in the `Map`.
> Atm, the proposal implies that the same key might be used for the "main"
> and "restore" consumer of the same thread -- but we can make keys unique
> by adding a `-restore` suffix to the restore-consumer key if we merge
> both maps. -- Curious to hear what others think. I am very open to do it
> differently than currently proposed.
>
>
> -Matthias
>
>
> On 10/12/23 8:39 AM, Jun Rao wrote:
> > Hi, Matthias,
> >
> > Thanks for the reply.
> >
> > 130. What would be the semantic? If the timeout has expired and only some
> > of the client instances' id have been retrieved, does the call return the
> > partial result or throw an exception?
> >
> > 131. Could we group all consumer instances in a single method since we
> are
> > returning the key for each instance already? This probably also avoids
> > exposing implementation details that could change over time.
> >
> > Thanks,
> >
> > Jun
> >
> > On Thu, Oct 12, 2023 at 12:00 AM Sophie Blee-Goldman <
> sop...@responsive.dev>
> > wrote:
> >
> >> Regarding the naming, I personally think `clientInstanceId` makes sense
> for
> >> the plain clients
> >>   -- especially if we might later introduce the notion of an
> >> `applicationInstanceId`.
> >>
> >> I'm not a huge fan of `clientsInstanceIds` for the Kafka Streams API,
> >> though, can we use
> >> `clientInstanceIds` instead? (The difference being the placement of the
> >> plural 's')
> >> I would similarly rename the class to just ClientInstanceIds
> >>
> >> we can also not have a timeout-less overload,  because `KafkaStreams`
> does
> >>> not have a `default.api.timeout.ms` config either
> >>
> >> With respect to the timeout for the Kafka Streams API, I'm a bit
> confused
> >> by the
> >> doubletriple-negative of Matthias' comment here, but I was thinking
> about
> >> this
> >> earlier and this was my take: with the current proposal, we would allow
> >> users to pass
> >> in an absolute timeout as a parameter that would apply to the method as
> a
> >> whole.
> >> Meanwhile within the method we would issue separate calls to each of the
> >> clients using
> >> the default or user-configured value of their  `default.api.timeout.ms`
> as
> >> the timeout
> >> parameter.
> >>
> >> So the API as proposed makes sense to me.
> >>
> >>
> >> On Wed, Oct 11, 2023 at 6:48 PM Matthias J. Sax <mj...@apache.org>
> wrote:
> >>
> >>> In can answer 130 and 131.
> >>>
> >>> 130) We cannot guarantee that all clients are already initialized due
> to
> >>> race conditions. We plan to not allow calling
> >>> `KafkaStreams#clientsInstanceIds()` when the state is not RUNNING (or
> >>> REBALANCING) though -- guess this slipped on the KIP and should be
> >>> added? But because StreamThreads can be added dynamically (and producer
> >>> might be created dynamically at runtime; cf below), we still cannot
> >>> guarantee that all clients are already initialized when the method is
> >>> called. Of course, we assume that all clients are most likely
> initialize
> >>> on the happy path, and blocking calls to `client.clientInstanceId()`
> >>> should be rare.
> >>>
> >>> To address the worst case, we won't do a naive implementation and just
> >>> loop over all clients, but fan-out the call to the different
> >>> StreamThreads (and GlobalStreamThread if it exists), and use Futures to
> >>> gather the results.
> >>>
> >>> Currently, `StreamThreads` has 3 clients (if ALOS or EOSv2 is used), so
> >>> we might do 3 blocking calls in the worst case (for EOSv1 we get a
> >>> producer per tasks, and we might end up doing more blocking calls if
> the
> >>> producers are not initialized yet). Note that EOSv1 is already
> >>> deprecated, and we are also working on thread refactoring that will
> >>> reduce the number of client on StreamThread to 2 -- and we have more
> >>> refactoring planned to reduce the number of clients even further.
> >>>
> >>> Inside `KafakStreams#clientsInstanceIds()` we might only do single
> >>> blocking call for the admin client (ie, `admin.clientInstanceId()`).
> >>>
> >>> I agree that we need to do some clever timeout management, but it seems
> >>> to be more of an implementation detail?
> >>>
> >>> Do you have any particular concerns, or does the proposed
> implementation
> >>> as sketched above address your question?
> >>>
> >>>
> >>> 130) If the Topology does not have a global-state-store, there won't be
> >>> a GlobalThread and thus not global consumer. Thus, we return an
> Optional.
> >>>
> >>>
> >>>
> >>> On three related question for Andrew.
> >>>
> >>> (1) Why is the method called `clientInstanceId()` and not just plain
> >>> `instanceId()`?
> >>>
> >>> (2) Why so we return a `String` while but not a UUID type? The added
> >>> protocol request/response classes use UUIDs.
> >>>
> >>> (3) Would it make sense to have an overloaded `clientInstanceId()`
> >>> method that does not take any parameter but uses `default.api.timeout`
> >>> config (this config does no exist on the producer though, so we could
> >>> only have it for consumer and admin at this point). We could of course
> >>> also add overloads like this later if user request them (and/or add
> >>> `default.api.timeout.ms` to the producer, too).
> >>>
> >>> Btw: For KafkaStreams, I think `clientsInstanceIds` still makes sense
> as
> >>> a method name though, as `KafkaStreams` itself does not have an
> >>> `instanceId` -- we can also not have a timeout-less overload, because
> >>> `KafkaStreams` does not have a `default.api.timeout.ms` config either
> >>> (and I don't think it make sense to add).
> >>>
> >>>
> >>>
> >>> -Matthias
> >>>
> >>> On 10/11/23 5:07 PM, Jun Rao wrote:
> >>>> Hi, Andrew,
> >>>>
> >>>> Thanks for the updated KIP. Just a few more minor comments.
> >>>>
> >>>> 130. KafkaStreams.clientsInstanceId(Duration timeout): Does it wait
> for
> >>> all
> >>>> consumer/producer/adminClient instances to be initialized? Are all
> >> those
> >>>> instances created during KafkaStreams initialization?
> >>>>
> >>>> 131. Why does globalConsumerInstanceId() return Optional<String> while
> >>>> other consumer instances don't return Optional?
> >>>>
> >>>> 132. ClientMetricsSubscriptionRequestCount: Do we need this since we
> >>> have a
> >>>> set of generic metrics
> >>>> (kafka.network:type=RequestMetrics,name=RequestsPerSec,request=*) that
> >>>> report Request rate for every request type?
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Jun
> >>>>
> >>>> On Wed, Oct 11, 2023 at 1:47 PM Matthias J. Sax <mj...@apache.org>
> >>> wrote:
> >>>>
> >>>>> Thanks!
> >>>>>
> >>>>> On 10/10/23 11:31 PM, Andrew Schofield wrote:
> >>>>>> Matthias,
> >>>>>> Yes, I think that’s a sensible way forward and the interface you
> >>> propose
> >>>>> looks good. I’ll update the KIP accordingly.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Andrew
> >>>>>>
> >>>>>>> On 10 Oct 2023, at 23:01, Matthias J. Sax <mj...@apache.org>
> wrote:
> >>>>>>>
> >>>>>>> Andrew,
> >>>>>>>
> >>>>>>> yes I would like to get this change into KIP-714 right way. Seems
> to
> >>> be
> >>>>> important, as we don't know if/when a follow-up KIP for Kafka Streams
> >>> would
> >>>>> land.
> >>>>>>>
> >>>>>>> I was also thinking (and discussed with a few others) how to expose
> >>> it,
> >>>>> and we would propose the following:
> >>>>>>>
> >>>>>>> We add a new method to `KafkaStreams` class:
> >>>>>>>
> >>>>>>>       public ClientsInstanceIds clientsInstanceIds(Duration
> timeout);
> >>>>>>>
> >>>>>>> The returned object is like below:
> >>>>>>>
> >>>>>>>     public class ClientsInstanceIds {
> >>>>>>>       // we only have a single admin client per KS instance
> >>>>>>>       String adminInstanceId();
> >>>>>>>
> >>>>>>>       // we only have a single global consumer per KS instance (if
> >> any)
> >>>>>>>       // Optional<> because we might not have global-thread
> >>>>>>>       Optional<String> globalConsumerInstanceId();
> >>>>>>>
> >>>>>>>       // return a <threadKey -> ClientInstanceId> mapping
> >>>>>>>       // for the underlying (restore-)consumers/producers
> >>>>>>>       Map<String, String> mainConsumerInstanceIds();
> >>>>>>>       Map<String, String> restoreConsumerInstanceIds();
> >>>>>>>       Map<String, String> producerInstanceIds();
> >>>>>>> }
> >>>>>>>
> >>>>>>> For the `threadKey`, we would use some pattern like this:
> >>>>>>>
> >>>>>>>     [Stream|StateUpdater]Thread-<threadIdx>
> >>>>>>>
> >>>>>>>
> >>>>>>> Would this work from your POV?
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> -Matthias
> >>>>>>>
> >>>>>>>
> >>>>>>> On 10/9/23 2:15 AM, Andrew Schofield wrote:
> >>>>>>>> Hi Matthias,
> >>>>>>>> Good point. Makes sense to me.
> >>>>>>>> Is this something that can also be included in the proposed Kafka
> >>>>> Streams follow-on KIP, or would you prefer that I add it to KIP-714?
> >>>>>>>> I have a slight preference for the former to put all of the KS
> >>>>> enhancements into a separate KIP.
> >>>>>>>> Thanks,
> >>>>>>>> Andrew
> >>>>>>>>> On 7 Oct 2023, at 02:12, Matthias J. Sax <mj...@apache.org>
> >> wrote:
> >>>>>>>>>
> >>>>>>>>> Thanks Andrew. SGTM.
> >>>>>>>>>
> >>>>>>>>> One point you did not address is the idea to add a method to
> >>>>> `KafkaStreams` similar to the proposed `clientInstanceId()` that will
> >> be
> >>>>> added to consumer/producer/admin clients.
> >>>>>>>>>
> >>>>>>>>> Without addressing this, Kafka Streams users won't have a way to
> >> get
> >>>>> the assigned `instanceId` of the internally created clients, and thus
> >> it
> >>>>> would be very difficult for them to know which metrics that the
> broker
> >>>>> receives belong to a Kafka Streams app. It seems they would only find
> >>> the
> >>>>> `instanceIds` in the log4j output if they enable client logging?
> >>>>>>>>>
> >>>>>>>>> Of course, because there is multiple clients inside Kafka
> Streams,
> >>>>> the return type cannot be an single "String", but must be some some
> >>> complex
> >>>>> data structure -- we could either add a new class, or return a
> >>>>> Map<String,String> using a client key that maps to the `instanceId`.
> >>>>>>>>>
> >>>>>>>>> For example we could use the following key:
> >>>>>>>>>
> >>>>>>>>>
> >>>   [Global]StreamThread[-<threadIndex>][-restore][consumer|producer]
> >>>>>>>>>
> >>>>>>>>> (Of course, only the valid combination.)
> >>>>>>>>>
> >>>>>>>>> Or maybe even better, we might want to return a `Future` because
> >>>>> collection all the `instanceId` might be a blocking all on each
> >> client?
> >>> I
> >>>>> have already a few idea how it could be implemented but I don't think
> >> it
> >>>>> must be discussed on the KIP, as it's an implementation detail.
> >>>>>>>>>
> >>>>>>>>> Thoughts?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> -Matthias
> >>>>>>>>>
> >>>>>>>>> On 10/6/23 4:21 AM, Andrew Schofield wrote:
> >>>>>>>>>> Hi Matthias,
> >>>>>>>>>> Thanks for your comments. I agree that a follow-up KIP for Kafka
> >>>>> Streams makes sense. This KIP currently has made a bit
> >>>>>>>>>> of an effort to embrace KS, but it’s not enough by a long way.
> >>>>>>>>>> I have removed `application.id <http://application.id/>`. This
> >>>>> should be done properly in the follow-up KIP. I don’t believe there’s
> >> a
> >>>>> downside to
> >>>>>>>>>> removing it from this KIP.
> >>>>>>>>>> I have reworded the statement about temporarily. In practice,
> the
> >>>>> implementation of this KIP that’s going on while the voting
> >>>>>>>>>> progresses happens to use delta temporality, but that’s an
> >>>>> implementation detail. Supporting clients must support both
> >>>>>>>>>> temporalities.
> >>>>>>>>>> I thought about exposing the client instance ID as a metric, but
> >>>>> non-numeric metrics are not usual practice and tools
> >>>>>>>>>> do not universally support them. I don’t think the KIP is
> >> improved
> >>>>> by adding one now.
> >>>>>>>>>> I have also added constants for the various Config classes for
> >>>>> ENABLE_METRICS_PUSH_CONFIG, including to
> >>>>>>>>>> StreamsConfig. It’s best to be explicit about this.
> >>>>>>>>>> Thanks,
> >>>>>>>>>> Andrew
> >>>>>>>>>>> On 2 Oct 2023, at 23:47, Matthias J. Sax <mj...@apache.org>
> >>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Hi,
> >>>>>>>>>>>
> >>>>>>>>>>> I did not pay attention to this KIP in the past; seems it was
> >>>>> on-hold for a while.
> >>>>>>>>>>>
> >>>>>>>>>>> Overall it sounds very useful, and I think we should extend
> this
> >>>>> with a follow up KIP for Kafka Streams. What is unclear to me at this
> >>> point
> >>>>> is the statement:
> >>>>>>>>>>>
> >>>>>>>>>>>> Kafka Streams applications have an application.id configured
> >> and
> >>>>> this identifier should be included as the application_id metrics
> >> label.
> >>>>>>>>>>>
> >>>>>>>>>>> The `application.id` is currently only used as the (main)
> >>>>> consumer's `group.id` (and is part of an auto-generated `client.id`
> >> if
> >>>>> the user does not set one).
> >>>>>>>>>>>
> >>>>>>>>>>> This comment related to:
> >>>>>>>>>>>
> >>>>>>>>>>>> The following labels should be added by the client as
> >> appropriate
> >>>>> before metrics are pushed.
> >>>>>>>>>>>
> >>>>>>>>>>> Given that Kafka Streams uses the consumer/producer/admin
> client
> >>> as
> >>>>> "black boxes", a client does at this point not know that it's part of
> >> a
> >>>>> Kafka Streams application, and thus, it won't be able to attach any
> >> such
> >>>>> label to the metrics it sends. (Also producer and admin don't even
> >> know
> >>> the
> >>>>> value of `application.id` -- only the (main) consumer, indirectly
> >> via `
> >>>>> group.id`, but also restore and global consumer don't know it,
> >> because
> >>>>> they don't have `group.id` set).
> >>>>>>>>>>>
> >>>>>>>>>>> While I am totally in favor of the proposal, I am wondering how
> >> we
> >>>>> intent to implement it in clean way? Or would we do ok to have some
> >>>>> internal client APIs that KS can use to "register" itself with the
> >>> client?
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> While clients must support both temporalities, the broker will
> >>>>> initially only send
> >>> GetTelemetrySubscriptionsResponse.DeltaTemporality=True
> >>>>>>>>>>>
> >>>>>>>>>>> Not sure if I can follow. How make the decision about DELTA or
> >>>>> CUMULATIVE metrics? Should the broker side plugin not decide what
> >>> metrics
> >>>>> it what to receive in which form? So what does "initially" mean --
> the
> >>>>> broker won't ship with a default plugin implementation?
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> The following method is added to the Producer, Consumer, and
> >>> Admin
> >>>>> client interfaces:
> >>>>>>>>>>>
> >>>>>>>>>>> Should we add anything to Kafka Streams to expose the
> underlying
> >>>>> clients' assigned client-instance-ids programmatically? I am also
> >>> wondering
> >>>>> if clients should report their assigned client-instance-ids as
> metrics
> >>>>> itself (for this case, Kafka Streams won't need to do anything,
> >> because
> >>> we
> >>>>> already expose all client metrics).
> >>>>>>>>>>>
> >>>>>>>>>>> If we add anything programmatic, we need to make it simple,
> >> given
> >>>>> that Kafka Streams has many clients per `StreamThread` and may have
> >>>>> multiple threads.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> enable.metrics.push
> >>>>>>>>>>> It might be worth to add this to `StreamsConfig`, too? It set
> >> via
> >>>>> StreamsConfig, we would forward it to all clients automatically.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> -Matthias
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On 9/29/23 5:45 PM, David Jacot wrote:
> >>>>>>>>>>>> Hi Andrew,
> >>>>>>>>>>>> Thanks for driving this one. I haven't read all the KIP yet
> >> but I
> >>>>> already
> >>>>>>>>>>>> have an initial question. In the Threading section, it is
> >> written
> >>>>>>>>>>>> "KafkaConsumer: the "background" thread (based on the consumer
> >>>>> threading
> >>>>>>>>>>>> refactor which is underway)". If I understand this correctly,
> >> it
> >>>>> means
> >>>>>>>>>>>> that KIP-714 won't work if the "old consumer" is used. Am I
> >>>>> correct?
> >>>>>>>>>>>> Cheers,
> >>>>>>>>>>>> David
> >>>>>>>>>>>> On Fri, Sep 22, 2023 at 12:18 PM Andrew Schofield <
> >>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
> >>>>>>>>>>>>> Hi Philip,
> >>>>>>>>>>>>> No, I do not think it should actively search for a broker
> that
> >>>>> supports
> >>>>>>>>>>>>> the new
> >>>>>>>>>>>>> RPCs. In general, either all of the brokers or none of the
> >>>>> brokers will
> >>>>>>>>>>>>> support it.
> >>>>>>>>>>>>> In the window, where the cluster is being upgraded or client
> >>>>> telemetry is
> >>>>>>>>>>>>> being
> >>>>>>>>>>>>> enabled, there might be a mixed situation. I wouldn’t put too
> >>>>> much effort
> >>>>>>>>>>>>> into
> >>>>>>>>>>>>> this mixed scenario. As the client finds brokers which
> support
> >>>>> the new
> >>>>>>>>>>>>> RPCs,
> >>>>>>>>>>>>> it can begin to follow the KIP-714 mechanism.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>> Andrew
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> On 22 Sep 2023, at 20:01, Philip Nee <philip...@gmail.com>
> >>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Andrew -
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Question on top of your answers: Do you think the client
> >> should
> >>>>> actively
> >>>>>>>>>>>>>> search for a broker that supports this RPC? As previously
> >>>>> mentioned, the
> >>>>>>>>>>>>>> broker uses the leastLoadedNode to find its first connection
> >>> (am
> >>>>>>>>>>>>>> I correct?), and what if that broker doesn't support the
> >> metric
> >>>>> push?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> P
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Fri, Sep 22, 2023 at 10:20 AM Andrew Schofield <
> >>>>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi Kirk,
> >>>>>>>>>>>>>>> Thanks for your question. You are correct that the presence
> >> or
> >>>>> absence
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>>>> the new RPCs in the
> >>>>>>>>>>>>>>> ApiVersionsResponse tells the client whether to request the
> >>>>> telemetry
> >>>>>>>>>>>>>>> subscriptions and push
> >>>>>>>>>>>>>>> metrics.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> This is of course tricky in practice. It would be
> >> conceivable,
> >>>>> as a
> >>>>>>>>>>>>>>> cluster is upgraded to AK 3.7
> >>>>>>>>>>>>>>> or as a client metrics receiver plugin is deployed across
> >> the
> >>>>> cluster,
> >>>>>>>>>>>>>>> that a client connects to some
> >>>>>>>>>>>>>>> brokers that support the new RPCs and some that do not.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Here’s my suggestion:
> >>>>>>>>>>>>>>> * If a client is not connected to any brokers that support
> >> in
> >>>>> the new
> >>>>>>>>>>>>>>> RPCs, it cannot push metrics.
> >>>>>>>>>>>>>>> * If a client is only connected to brokers that support the
> >>> new
> >>>>> RPCs, it
> >>>>>>>>>>>>>>> will use the new RPCs in
> >>>>>>>>>>>>>>> accordance with the KIP.
> >>>>>>>>>>>>>>> * If a client is connected to some brokers that support the
> >>> new
> >>>>> RPCs and
> >>>>>>>>>>>>>>> some that do not, it will
> >>>>>>>>>>>>>>> use the new RPCs with the supporting subset of brokers in
> >>>>> accordance
> >>>>>>>>>>>>> with
> >>>>>>>>>>>>>>> the KIP.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Comments?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>> Andrew
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On 22 Sep 2023, at 16:01, Kirk True <k...@kirktrue.pro>
> >>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi Andrew/Jun,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I want to make sure I understand question/comment #119… In
> >>> the
> >>>>> case
> >>>>>>>>>>>>>>> where a cluster without a metrics client receiver is later
> >>>>> reconfigured
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>> restarted to include a metrics client receiver, do we want
> >> the
> >>>>> client to
> >>>>>>>>>>>>>>> thereafter begin pushing metrics to the cluster? From
> >> Andrew’s
> >>>>> response
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>> question #119, it sounds like we’re using the
> >> presence/absence
> >>>>> of the
> >>>>>>>>>>>>>>> relevant RPCs in ApiVersionsResponse as the
> >>>>> to-push-or-not-to-push
> >>>>>>>>>>>>>>> indicator. Do I have that correct?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>> Kirk
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Sep 21, 2023, at 7:42 AM, Andrew Schofield <
> >>>>>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hi Jun,
> >>>>>>>>>>>>>>>>> Thanks for your comments. I’ve updated the KIP to clarify
> >>>>> where
> >>>>>>>>>>>>>>> necessary.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 110. Yes, agree. The motivation section mentions this.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 111. The replacement of ‘-‘ with ‘.’ for metric names and
> >>> the
> >>>>>>>>>>>>>>> replacement of
> >>>>>>>>>>>>>>>>> ‘-‘ with ‘_’ for attribute keys is following the OTLP
> >>>>> guidelines. I
> >>>>>>>>>>>>>>> think it’s a bit
> >>>>>>>>>>>>>>>>> of a debatable point. OTLP makes a distinction between a
> >>>>> namespace
> >>>>>>>>>>>>> and a
> >>>>>>>>>>>>>>>>> multi-word component. If it was “client.id” then
> “client”
> >>>>> would be a
> >>>>>>>>>>>>>>> namespace with
> >>>>>>>>>>>>>>>>> an attribute key “id”. But “client_id” is just a key. So,
> >> it
> >>>>> was
> >>>>>>>>>>>>>>> intentional, but debatable.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 112. Thanks. The link target moved. Fixed.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 113. Thanks. Fixed.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 114.1. If a standard metric makes sense for a client, it
> >>>>> should use
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>> exact same
> >>>>>>>>>>>>>>>>> name. If a standard metric doesn’t make sense for a
> >> client,
> >>>>> then it
> >>>>>>>>>>>>> can
> >>>>>>>>>>>>>>> omit that metric.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> For a required metric, the situation is stronger. All
> >>> clients
> >>>>> must
> >>>>>>>>>>>>>>> implement these
> >>>>>>>>>>>>>>>>> metrics with these names in order to implement the KIP.
> >> But
> >>>>> the
> >>>>>>>>>>>>>>> required metrics
> >>>>>>>>>>>>>>>>> are essentially the number of connections and the request
> >>>>> latency,
> >>>>>>>>>>>>>>> which do not
> >>>>>>>>>>>>>>>>> reference the underlying implementation of the client
> >> (which
> >>>>>>>>>>>>>>> producer.record.queue.time.max
> >>>>>>>>>>>>>>>>> of course does).
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I suppose someone might build a producer-only client that
> >>>>> didn’t have
> >>>>>>>>>>>>>>> consumer metrics.
> >>>>>>>>>>>>>>>>> In this case, the consumer metrics would conceptually
> have
> >>>>> the value 0
> >>>>>>>>>>>>>>> and would not
> >>>>>>>>>>>>>>>>> need to be sent to the broker.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 114.2. If a client does not implement some metrics, they
> >>> will
> >>>>> not be
> >>>>>>>>>>>>>>> available for
> >>>>>>>>>>>>>>>>> analysis and troubleshooting. It just makes the ability
> to
> >>>>> combine
> >>>>>>>>>>>>>>> metrics from lots
> >>>>>>>>>>>>>>>>> different clients less complete.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 115. I think it was probably a mistake to be so specific
> >>> about
> >>>>>>>>>>>>>>> threading in this KIP.
> >>>>>>>>>>>>>>>>> When the consumer threading refactor is complete, of
> >> course,
> >>>>> it would
> >>>>>>>>>>>>>>> do the appropriate
> >>>>>>>>>>>>>>>>> equivalent. I’ve added a clarification and massively
> >>>>> simplified this
> >>>>>>>>>>>>>>> section.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 116. I removed “client.terminating”.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 117. Yes. Horrid. Fixed.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 118. The Terminating flag just indicates that this is the
> >>>>> final
> >>>>>>>>>>>>>>> PushTelemetryRequest
> >>>>>>>>>>>>>>>>> from this client. Any subsequent request will be
> >> rejected. I
> >>>>> think
> >>>>>>>>>>>>> this
> >>>>>>>>>>>>>>> flag should remain.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 119. Good catch. This was actually contradicting another
> >>> part
> >>>>> of the
> >>>>>>>>>>>>>>> KIP. The current behaviour
> >>>>>>>>>>>>>>>>> is indeed preserved. If the broker doesn’t have a client
> >>>>> metrics
> >>>>>>>>>>>>>>> receiver plugin, the new RPCs
> >>>>>>>>>>>>>>>>> in this KIP are “turned off” and not reported in
> >>>>> ApiVersionsResponse.
> >>>>>>>>>>>>>>> The client will not
> >>>>>>>>>>>>>>>>> attempt to push metrics.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 120. The error handling table lists the error codes for
> >>>>>>>>>>>>>>> PushTelemetryResponse. I’ve added one
> >>>>>>>>>>>>>>>>> but it looked good to me. GetTelemetrySubscriptions
> >> doesn’t
> >>>>> have any
> >>>>>>>>>>>>>>> error codes, since the
> >>>>>>>>>>>>>>>>> situation in which the client telemetry is not supported
> >> is
> >>>>> handled by
> >>>>>>>>>>>>>>> the RPCs not being offered
> >>>>>>>>>>>>>>>>> by the broker.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 121. Again, I think it’s probably a mistake to be
> specific
> >>>>> about
> >>>>>>>>>>>>>>> threading. Removed.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 122. Good catch. For DescribeConfigs, the ACL operation
> >>>>> should be
> >>>>>>>>>>>>>>>>> “DESCRIBE_CONFIGS”. For AlterConfigs, the ACL operation
> >>>>> should be
> >>>>>>>>>>>>>>>>> “ALTER” (not “WRITE” as it said). The checks are made on
> >> the
> >>>>> CLUSTER
> >>>>>>>>>>>>>>>>> resource.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks for the detailed review.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>> Andrew
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> 110. Another potential motivation is the multiple
> clients
> >>>>> support.
> >>>>>>>>>>>>>>> Some of
> >>>>>>>>>>>>>>>>>> the places may not have good monitoring support for
> >>> non-java
> >>>>> clients.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> 111. OpenTelemetry Naming: We replace '-' with '.' for
> >>>>> metric name
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>> replace '-' with '_' for attributes. Why is the
> >>>>> inconsistency?
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> 112. OTLP specification: Page is not found from the
> link.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> 113. "Defining standard and required metrics makes the
> >>>>> monitoring and
> >>>>>>>>>>>>>>>>>> troubleshooting of clients from various client types ":
> >>>>> Incomplete
> >>>>>>>>>>>>>>> sentence.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> 114. standard/required metrics
> >>>>>>>>>>>>>>>>>> 114.1 Do other clients need to implement those metrics
> >> with
> >>>>> the exact
> >>>>>>>>>>>>>>> same
> >>>>>>>>>>>>>>>>>> names?
> >>>>>>>>>>>>>>>>>> 114.2 What happens if some of those metrics are missing
> >>> from
> >>>>> a
> >>>>>>>>>>>>> client?
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> 115. "KafkaConsumer: both the "heart beat" and
> >> application
> >>>>> threads":
> >>>>>>>>>>>>> We
> >>>>>>>>>>>>>>>>>> have an ongoing effort to refactor the consumer
> threading
> >>>>> model (
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/Consumer+threading+refactor+design
> >>>>>>>>>>>>>>> ).
> >>>>>>>>>>>>>>>>>> Once this is done, PRC requests will only be made from
> >> the
> >>>>> background
> >>>>>>>>>>>>>>>>>> thread. Should this KIP follow the new model only?
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> 116. 'The metrics should contain the reason for the
> >> client
> >>>>>>>>>>>>> termination
> >>>>>>>>>>>>>>> by
> >>>>>>>>>>>>>>>>>> including the client.terminating metric with the label
> >>>>> “reason” ...'.
> >>>>>>>>>>>>>>> Hmm,
> >>>>>>>>>>>>>>>>>> are we introducing a new metric client.terminating? If
> >> so,
> >>>>> that needs
> >>>>>>>>>>>>>>> to be
> >>>>>>>>>>>>>>>>>> explicitly listed.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> 117. "As the metrics plugin may need to add additional
> >>>>> metrics on top
> >>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>> this the generic metrics receiver in the broker will not
> >>> add
> >>>>> these
> >>>>>>>>>>>>>>> labels
> >>>>>>>>>>>>>>>>>> but rely on the plugins to do so," The sentence doesn't
> >>> read
> >>>>> well.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> 118. "it is possible for the client to send at most one
> >>>>> accepted
> >>>>>>>>>>>>>>>>>> out-of-profile per connection before the rate-limiter
> >> kicks
> >>>>> in": If
> >>>>>>>>>>>>> we
> >>>>>>>>>>>>>>> do
> >>>>>>>>>>>>>>>>>> this, do we still need the Terminating flag in
> >>>>>>>>>>>>> PushTelemetryRequestV0?
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> 119. "If there is no client metrics receiver plugin
> >>>>> configured on the
> >>>>>>>>>>>>>>>>>> broker, it will respond to
> >> GetTelemetrySubscriptionsRequest
> >>>>> with
> >>>>>>>>>>>>>>>>>> RequestedMetrics set to Null and a -1 SubscriptionId.
> The
> >>>>> client
> >>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>> send a new GetTelemetrySubscriptionsRequest after the
> >>>>> PushIntervalMs
> >>>>>>>>>>>>>>> has
> >>>>>>>>>>>>>>>>>> expired. This allows the metrics receiver to be enabled
> >> or
> >>>>> disabled
> >>>>>>>>>>>>>>> without
> >>>>>>>>>>>>>>>>>> having to restart the broker or reset the client
> >>> connection."
> >>>>>>>>>>>>>>>>>> "no client metrics receiver plugin configured" is
> defined
> >>> by
> >>>>> no
> >>>>>>>>>>>>> metric
> >>>>>>>>>>>>>>>>>> reporter implementing the ClientTelemetry interface,
> >> right?
> >>>>> In that
> >>>>>>>>>>>>>>> case,
> >>>>>>>>>>>>>>>>>> it would be useful to avoid the clients sending
> >>>>>>>>>>>>>>>>>> GetTelemetrySubscriptionsRequest periodically to
> preserve
> >>>>> the current
> >>>>>>>>>>>>>>>>>> behavior.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> 120. GetTelemetrySubscriptionsResponseV0 and
> >>>>> PushTelemetryRequestV0:
> >>>>>>>>>>>>>>> Could
> >>>>>>>>>>>>>>>>>> we list error codes for each?
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> 121. "ClientTelemetryReceiver.ClientTelemetryReceiver
> >> This
> >>>>> method may
> >>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>> called from the request handling thread": Where else can
> >>>>> this method
> >>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>> called?
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> 122. DescribeConfigs/AlterConfigs already exist. Are we
> >>>>> changing the
> >>>>>>>>>>>>>>> ACL?
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Jun
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Mon, Jul 31, 2023 at 4:33 AM Andrew Schofield <
> >>>>>>>>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Hi Milind,
> >>>>>>>>>>>>>>>>>>> Thanks for your question.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On reflection, I agree that INVALID_RECORD is most
> >> likely
> >>>>> to be
> >>>>>>>>>>>>>>> caused by a
> >>>>>>>>>>>>>>>>>>> problem in the serialization in the client. I have
> >> changed
> >>>>> the
> >>>>>>>>>>>>> client
> >>>>>>>>>>>>>>>>>>> action in this case
> >>>>>>>>>>>>>>>>>>> to “Log an error and stop pushing metrics”.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> I have updated the KIP text accordingly.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>> Andrew
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On 31 Jul 2023, at 12:09, Milind Luthra
> >>>>>>>>>>>>>>> <milut...@confluent.io.INVALID>
> >>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Hi Andrew,
> >>>>>>>>>>>>>>>>>>>> Thanks for the clarifications.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> About 2b:
> >>>>>>>>>>>>>>>>>>>> In case a client has a bug while serializing, it might
> >> be
> >>>>> difficult
> >>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> client to recover from that without code changes. In
> >>> that,
> >>>>> it might
> >>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>> good
> >>>>>>>>>>>>>>>>>>>> to just log the INVALID_RECORD as an error, and treat
> >> the
> >>>>> error as
> >>>>>>>>>>>>>>> fatal
> >>>>>>>>>>>>>>>>>>>> for the client (only fatal in terms of sending the
> >>>>> metrics, the
> >>>>>>>>>>>>>>> client
> >>>>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>> keep functioning otherwise). What do you think?
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Thanks
> >>>>>>>>>>>>>>>>>>>> Milind
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On Mon, Jul 24, 2023 at 8:18 PM Andrew Schofield <
> >>>>>>>>>>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Hi Milind,
> >>>>>>>>>>>>>>>>>>>>> Thanks for your questions about the KIP.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> 1) I did some archaeology and looked at historical
> >>>>> versions of the
> >>>>>>>>>>>>>>> KIP.
> >>>>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>> think this is
> >>>>>>>>>>>>>>>>>>>>> just a mistake. 5 minutes is the default metric push
> >>>>> interval. 30
> >>>>>>>>>>>>>>>>>>> minutes
> >>>>>>>>>>>>>>>>>>>>> is a mystery
> >>>>>>>>>>>>>>>>>>>>> to me. I’ve updated the KIP.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> 2) I think there are two situations in which
> >>>>> INVALID_RECORD might
> >>>>>>>>>>>>>>> occur.
> >>>>>>>>>>>>>>>>>>>>> a) The client might perhaps be using a content-type
> >> that
> >>>>> the
> >>>>>>>>>>>>> broker
> >>>>>>>>>>>>>>> does
> >>>>>>>>>>>>>>>>>>>>> not support.
> >>>>>>>>>>>>>>>>>>>>> The KIP mentions content-type as a future extension,
> >> but
> >>>>> there’s
> >>>>>>>>>>>>>>> only
> >>>>>>>>>>>>>>>>>>> one
> >>>>>>>>>>>>>>>>>>>>> supported
> >>>>>>>>>>>>>>>>>>>>> to start with. Until we have multiple content-types,
> >>> this
> >>>>> seems
> >>>>>>>>>>>>> out
> >>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>> scope. I think a
> >>>>>>>>>>>>>>>>>>>>> future KIP would add another error code for this.
> >>>>>>>>>>>>>>>>>>>>> b) The client might perhaps have a bug which means
> the
> >>>>> metrics
> >>>>>>>>>>>>>>> payload
> >>>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>> malformed.
> >>>>>>>>>>>>>>>>>>>>> Logging a warning and attempting the next metrics
> push
> >>> on
> >>>>> the push
> >>>>>>>>>>>>>>>>>>>>> interval seems
> >>>>>>>>>>>>>>>>>>>>> appropriate.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> UNKNOWN_SUBSCRIPTION_ID would indeed be handled by
> >>> making
> >>>>> an
> >>>>>>>>>>>>>>> immediate
> >>>>>>>>>>>>>>>>>>>>> GetTelemetrySubscriptionsRequest.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> UNSUPPORTED_COMPRESSION_TYPE seems like either a
> >> client
> >>>>> bug or
> >>>>>>>>>>>>>>> perhaps
> >>>>>>>>>>>>>>>>>>>>> a situation in which a broker sends a compression
> type
> >>> in
> >>>>> a
> >>>>>>>>>>>>>>>>>>>>> GetTelemetrySubscriptionsResponse
> >>>>>>>>>>>>>>>>>>>>> which is subsequently not supported when its used
> >> with a
> >>>>>>>>>>>>>>>>>>>>> PushTelemetryRequest.
> >>>>>>>>>>>>>>>>>>>>> We do want the client to have the opportunity to get
> >> an
> >>>>> up-to-date
> >>>>>>>>>>>>>>> list
> >>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>> supported
> >>>>>>>>>>>>>>>>>>>>> compression types. I think an immediate
> >>>>>>>>>>>>>>> GetTelemetrySubscriptionsRequest
> >>>>>>>>>>>>>>>>>>>>> is appropriate.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> 3) If a client attempts a subsequent handshake with a
> >>> Null
> >>>>>>>>>>>>>>>>>>>>> ClientInstanceId, the
> >>>>>>>>>>>>>>>>>>>>> receiving broker may not already know the client's
> >>>>> existing
> >>>>>>>>>>>>>>>>>>>>> ClientInstanceId. If the
> >>>>>>>>>>>>>>>>>>>>> receiving broker knows the existing ClientInstanceId,
> >> it
> >>>>> simply
> >>>>>>>>>>>>>>> responds
> >>>>>>>>>>>>>>>>>>>>> the existing
> >>>>>>>>>>>>>>>>>>>>> value back to the client. If it does not know the
> >>> existing
> >>>>>>>>>>>>>>>>>>>>> ClientInstanceId, it will create
> >>>>>>>>>>>>>>>>>>>>> a new client instance ID and respond with that.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> I will update the KIP with these clarifications.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>> Andrew
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> On 17 Jul 2023, at 14:21, Milind Luthra
> >>>>>>>>>>>>>>> <milut...@confluent.io.INVALID
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Hi Andrew, thanks for this KIP.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> I had a few questions regarding the "Error handling"
> >>>>> section.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> 1. It mentions that "The 5 and 30 minute retries are
> >> to
> >>>>>>>>>>>>> eventually
> >>>>>>>>>>>>>>>>>>>>> trigger
> >>>>>>>>>>>>>>>>>>>>>> a retry and avoid having to restart clients if the
> >>>>> cluster
> >>>>>>>>>>>>> metrics
> >>>>>>>>>>>>>>>>>>>>>> configuration is disabled temporarily, e.g., by
> >>> operator
> >>>>> error,
> >>>>>>>>>>>>>>> rolling
> >>>>>>>>>>>>>>>>>>>>>> upgrades, etc."
> >>>>>>>>>>>>>>>>>>>>>> But this 30 min interval isn't mentioned anywhere
> >> else.
> >>>>> What is
> >>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>>>>>>> referring to?
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> 2. For the actual errors:
> >>>>>>>>>>>>>>>>>>>>>> INVALID_RECORD : The action required is to "Log a
> >>>>> warning to the
> >>>>>>>>>>>>>>>>>>>>>> application and schedule the next
> >>>>>>>>>>>>> GetTelemetrySubscriptionsRequest
> >>>>>>>>>>>>>>> to 5
> >>>>>>>>>>>>>>>>>>>>>> minutes". Why is this 5 minutes, and not something
> >> like
> >>>>>>>>>>>>>>> PushIntervalMs?
> >>>>>>>>>>>>>>>>>>>>> And
> >>>>>>>>>>>>>>>>>>>>>> also, why are we scheduling a
> >>>>> GetTelemetrySubscriptionsRequest in
> >>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>> case, if the serialization is broken?
> >>>>>>>>>>>>>>>>>>>>>> UNKNOWN_SUBSCRIPTION_ID ,
> >> UNSUPPORTED_COMPRESSION_TYPE
> >>> :
> >>>>> just to
> >>>>>>>>>>>>>>>>>>> confirm,
> >>>>>>>>>>>>>>>>>>>>>> the GetTelemetrySubscriptionsRequest needs to be
> >>>>> scheduled
> >>>>>>>>>>>>>>> immediately
> >>>>>>>>>>>>>>>>>>>>>> after the PushTelemetry response, correct?
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> 3. For "Subsequent GetTelemetrySubscriptionsRequests
> >>>>> must include
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>> ClientInstanceId returned in the first response,
> >>>>> regardless of
> >>>>>>>>>>>>>>> broker":
> >>>>>>>>>>>>>>>>>>>>>> Will a broker error be returned in case some
> >>>>> implementation of
> >>>>>>>>>>>>>>> this KIP
> >>>>>>>>>>>>>>>>>>>>>> violates this accidentally and sends a request with
> >>>>>>>>>>>>>>> ClientInstanceId =
> >>>>>>>>>>>>>>>>>>>>> Null
> >>>>>>>>>>>>>>>>>>>>>> even when it's been obtained already? Or will a new
> >>>>>>>>>>>>>>> ClientInstanceId be
> >>>>>>>>>>>>>>>>>>>>>> returned without an error?
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Thanks!
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> On Tue, Jun 13, 2023 at 8:38 PM Andrew Schofield <
> >>>>>>>>>>>>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>>>>>>>> I would like to start a new discussion thread on
> >>>>> KIP-714: Client
> >>>>>>>>>>>>>>>>>>> metrics
> >>>>>>>>>>>>>>>>>>>>>>> and observability.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> I have edited the proposal significantly to reduce
> >> the
> >>>>> scope.
> >>>>>>>>>>>>> The
> >>>>>>>>>>>>>>>>>>>>> overall
> >>>>>>>>>>>>>>>>>>>>>>> mechanism for client metric subscriptions is
> >>> unchanged,
> >>>>> but the
> >>>>>>>>>>>>>>>>>>>>>>> KIP is now based on the existing client metrics,
> >>> rather
> >>>>> than
> >>>>>>>>>>>>>>>>>>> introducing
> >>>>>>>>>>>>>>>>>>>>>>> new metrics. The purpose remains helping cluster
> >>>>> operators
> >>>>>>>>>>>>>>>>>>>>>>> investigate performance problems experienced by
> >>> clients
> >>>>> without
> >>>>>>>>>>>>>>>>>>>>> requiring
> >>>>>>>>>>>>>>>>>>>>>>> changes to the client application code or
> >>> configuration.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>>>> Andrew
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Reply via email to