Re: [DISCUSS] KIP-714: Client metrics and observability

2023-10-13 Thread Andrew Schofield
Hi Jun,
Thanks for the clarifications.

131. The client instance ids returned from 
KafkaStreams.clientInstanceIds(Duration) correspond to the
client_instance_id labels added by the broker to the metrics pushed from the 
clients. This should be
sufficient information to enable correlation between the metrics available in 
the client, and the metrics
pushed to the broker.

132. Yes, I see. I used JMX to look at the metrics on my broker and you’re 
entirely right. I will
remove the redundant metric from the KIP.

Thanks,
Andrew

> On 12 Oct 2023, at 20:12, Jun Rao  wrote:
>
> Hi, Andrew,
>
> Thanks for the reply.
>
> 131. Could we also document how one could correlate each client instance in
> KStreams with the labels for the metrics received by the brokers?
>
> 132. The documentation for RequestsPerSec is not complete. If you trace
> through how
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/network/RequestChannel.scala#L71
> 
> is
> implemented, it includes every API key tagged with the corresponding
> listener.
>
> Jun
>
> On Thu, Oct 12, 2023 at 11:42 AM Andrew Schofield <
> andrew_schofield_j...@outlook.com> wrote:
>
>> Hi Jun,
>> Thanks for your comments.
>>
>> 130. As Matthias described, and I am adding to the KIP, the
>> `KafkaStreams#clientInstanceIds` method
>> is only permitted when the state is RUNNING or REBALANCING. Also, clients
>> can be added dynamically
>> so the maps might change over time. If it’s in a permitted state, the
>> method is prepared to wait up to the
>> supplied timeout to get the client instance ids. It does not return a
>> partial result - it returns a result or
>> fails.
>>
>> 131. I’ve refactored the `ClientsInstanceIds` object and the global
>> consumer is now part of the map
>> of consumers. There is no need for the Optional any longer. I’ve also
>> renamed it `ClientInstanceIds`.
>>
>> 132. My reading of
>> `(kafka.network:type=RequestMetrics,name=RequestsPerSec,request=*)` is that
>> It does not support every request type - it supports Produce,
>> FetchConsumer and FetchFollower.
>> Consequently, I think the ClientMetricsSubscriptionRequestCount is not
>> instantly obsolete.
>>
>> If I’ve misunderstood, please let me know.
>>
>> Thanks,
>> Andrew
>>
>>
>>> On 12 Oct 2023, at 01:07, Jun Rao  wrote:
>>>
>>> Hi, Andrew,
>>>
>>> Thanks for the updated KIP. Just a few more minor comments.
>>>
>>> 130. KafkaStreams.clientsInstanceId(Duration timeout): Does it wait for
>> all
>>> consumer/producer/adminClient instances to be initialized? Are all those
>>> instances created during KafkaStreams initialization?
>>>
>>> 131. Why does globalConsumerInstanceId() return Optional while
>>> other consumer instances don't return Optional?
>>>
>>> 132. ClientMetricsSubscriptionRequestCount: Do we need this since we
>> have a
>>> set of generic metrics
>>> (kafka.network:type=RequestMetrics,name=RequestsPerSec,request=*) that
>>> report Request rate for every request type?
>>>
>>> Thanks,
>>>
>>> Jun
>>>
>>> On Wed, Oct 11, 2023 at 1:47 PM Matthias J. Sax 
>> wrote:
>>>
 Thanks!

 On 10/10/23 11:31 PM, Andrew Schofield wrote:
> Matthias,
> Yes, I think that’s a sensible way forward and the interface you
>> propose
 looks good. I’ll update the KIP accordingly.
>
> Thanks,
> Andrew
>
>> On 10 Oct 2023, at 23:01, Matthias J. Sax  wrote:
>>
>> Andrew,
>>
>> yes I would like to get this change into KIP-714 right way. Seems to
>> be
 important, as we don't know if/when a follow-up KIP for Kafka Streams
>> would
 land.
>>
>> I was also thinking (and discussed with a few others) how to expose
>> it,
 and we would propose the following:
>>
>> We add a new method to `KafkaStreams` class:
>>
>>   public ClientsInstanceIds clientsInstanceIds(Duration timeout);
>>
>> The returned object is like below:
>>
>> public class ClientsInstanceIds {
>>   // we only have a single admin client per KS instance
>>   String adminInstanceId();
>>
>>   // we only have a single global consumer per KS instance (if any)
>>   // Optional<> because we might not have global-thread
>>   Optional globalConsumerInstanceId();
>>
>>   // return a  ClientInstanceId> mapping
>>   // for the underlying (restore-)consumers/producers
>>   Map mainConsumerInstanceIds();
>>   Map restoreConsumerInstanceIds();
>>   Map producerInstanceIds();
>> }
>>
>> For the `threadKey`, we would use some pattern like this:
>>
>> [Stream|StateUpdater]Thread-
>>
>>
>> Would this work from your POV?
>>
>>
>>
>> -Matthias
>>
>>
>> On 10/9/23 2:15 AM, Andrew Schofield wrote:
>>> Hi Matthias,
>>> Good point. Makes sense to me.
>>> Is this something that can also be included in the proposed Kafka
 

Re: [DISCUSS] KIP-714: Client metrics and observability

2023-10-12 Thread Matthias J. Sax
Thanks Andrew. Makes sense to me. Adding the parameter-less overload was 
just a random idea. No need to extend the KIP.



-Matthias

On 10/12/23 12:12 PM, Jun Rao wrote:

Hi, Andrew,

Thanks for the reply.

131. Could we also document how one could correlate each client instance in
KStreams with the labels for the metrics received by the brokers?

132. The documentation for RequestsPerSec is not complete. If you trace
through how
https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/network/RequestChannel.scala#L71

is
implemented, it includes every API key tagged with the corresponding
listener.

Jun

On Thu, Oct 12, 2023 at 11:42 AM Andrew Schofield <
andrew_schofield_j...@outlook.com> wrote:


Hi Jun,
Thanks for your comments.

130. As Matthias described, and I am adding to the KIP, the
`KafkaStreams#clientInstanceIds` method
is only permitted when the state is RUNNING or REBALANCING. Also, clients
can be added dynamically
so the maps might change over time. If it’s in a permitted state, the
method is prepared to wait up to the
supplied timeout to get the client instance ids. It does not return a
partial result - it returns a result or
fails.

131. I’ve refactored the `ClientsInstanceIds` object and the global
consumer is now part of the map
of consumers. There is no need for the Optional any longer. I’ve also
renamed it `ClientInstanceIds`.

132. My reading of
`(kafka.network:type=RequestMetrics,name=RequestsPerSec,request=*)` is that
It does not support every request type - it supports Produce,
FetchConsumer and FetchFollower.
Consequently, I think the ClientMetricsSubscriptionRequestCount is not
instantly obsolete.

If I’ve misunderstood, please let me know.

Thanks,
Andrew



On 12 Oct 2023, at 01:07, Jun Rao  wrote:

Hi, Andrew,

Thanks for the updated KIP. Just a few more minor comments.

130. KafkaStreams.clientsInstanceId(Duration timeout): Does it wait for

all

consumer/producer/adminClient instances to be initialized? Are all those
instances created during KafkaStreams initialization?

131. Why does globalConsumerInstanceId() return Optional while
other consumer instances don't return Optional?

132. ClientMetricsSubscriptionRequestCount: Do we need this since we

have a

set of generic metrics
(kafka.network:type=RequestMetrics,name=RequestsPerSec,request=*) that
report Request rate for every request type?

Thanks,

Jun

On Wed, Oct 11, 2023 at 1:47 PM Matthias J. Sax 

wrote:



Thanks!

On 10/10/23 11:31 PM, Andrew Schofield wrote:

Matthias,
Yes, I think that’s a sensible way forward and the interface you

propose

looks good. I’ll update the KIP accordingly.


Thanks,
Andrew


On 10 Oct 2023, at 23:01, Matthias J. Sax  wrote:

Andrew,

yes I would like to get this change into KIP-714 right way. Seems to

be

important, as we don't know if/when a follow-up KIP for Kafka Streams

would

land.


I was also thinking (and discussed with a few others) how to expose

it,

and we would propose the following:


We add a new method to `KafkaStreams` class:

public ClientsInstanceIds clientsInstanceIds(Duration timeout);

The returned object is like below:

  public class ClientsInstanceIds {
// we only have a single admin client per KS instance
String adminInstanceId();

// we only have a single global consumer per KS instance (if any)
// Optional<> because we might not have global-thread
Optional globalConsumerInstanceId();

// return a  ClientInstanceId> mapping
// for the underlying (restore-)consumers/producers
Map mainConsumerInstanceIds();
Map restoreConsumerInstanceIds();
Map producerInstanceIds();
}

For the `threadKey`, we would use some pattern like this:

  [Stream|StateUpdater]Thread-


Would this work from your POV?



-Matthias


On 10/9/23 2:15 AM, Andrew Schofield wrote:

Hi Matthias,
Good point. Makes sense to me.
Is this something that can also be included in the proposed Kafka

Streams follow-on KIP, or would you prefer that I add it to KIP-714?

I have a slight preference for the former to put all of the KS

enhancements into a separate KIP.

Thanks,
Andrew

On 7 Oct 2023, at 02:12, Matthias J. Sax  wrote:

Thanks Andrew. SGTM.

One point you did not address is the idea to add a method to

`KafkaStreams` similar to the proposed `clientInstanceId()` that will be
added to consumer/producer/admin clients.


Without addressing this, Kafka Streams users won't have a way to get

the assigned `instanceId` of the internally created clients, and thus it
would be very difficult for them to know which metrics that the broker
receives belong to a Kafka Streams app. It seems they would only find

the

`instanceIds` in the log4j output if they enable client logging?


Of course, because there is multiple clients inside Kafka Streams,

the return type cannot be an single "String", but must be some some

complex

data structure -- 

Re: [DISCUSS] KIP-714: Client metrics and observability

2023-10-12 Thread Jun Rao
Hi, Andrew,

Thanks for the reply.

131. Could we also document how one could correlate each client instance in
KStreams with the labels for the metrics received by the brokers?

132. The documentation for RequestsPerSec is not complete. If you trace
through how
https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/network/RequestChannel.scala#L71

is
implemented, it includes every API key tagged with the corresponding
listener.

Jun

On Thu, Oct 12, 2023 at 11:42 AM Andrew Schofield <
andrew_schofield_j...@outlook.com> wrote:

> Hi Jun,
> Thanks for your comments.
>
> 130. As Matthias described, and I am adding to the KIP, the
> `KafkaStreams#clientInstanceIds` method
> is only permitted when the state is RUNNING or REBALANCING. Also, clients
> can be added dynamically
> so the maps might change over time. If it’s in a permitted state, the
> method is prepared to wait up to the
> supplied timeout to get the client instance ids. It does not return a
> partial result - it returns a result or
> fails.
>
> 131. I’ve refactored the `ClientsInstanceIds` object and the global
> consumer is now part of the map
> of consumers. There is no need for the Optional any longer. I’ve also
> renamed it `ClientInstanceIds`.
>
> 132. My reading of
> `(kafka.network:type=RequestMetrics,name=RequestsPerSec,request=*)` is that
> It does not support every request type - it supports Produce,
> FetchConsumer and FetchFollower.
> Consequently, I think the ClientMetricsSubscriptionRequestCount is not
> instantly obsolete.
>
> If I’ve misunderstood, please let me know.
>
> Thanks,
> Andrew
>
>
> > On 12 Oct 2023, at 01:07, Jun Rao  wrote:
> >
> > Hi, Andrew,
> >
> > Thanks for the updated KIP. Just a few more minor comments.
> >
> > 130. KafkaStreams.clientsInstanceId(Duration timeout): Does it wait for
> all
> > consumer/producer/adminClient instances to be initialized? Are all those
> > instances created during KafkaStreams initialization?
> >
> > 131. Why does globalConsumerInstanceId() return Optional while
> > other consumer instances don't return Optional?
> >
> > 132. ClientMetricsSubscriptionRequestCount: Do we need this since we
> have a
> > set of generic metrics
> > (kafka.network:type=RequestMetrics,name=RequestsPerSec,request=*) that
> > report Request rate for every request type?
> >
> > Thanks,
> >
> > Jun
> >
> > On Wed, Oct 11, 2023 at 1:47 PM Matthias J. Sax 
> wrote:
> >
> >> Thanks!
> >>
> >> On 10/10/23 11:31 PM, Andrew Schofield wrote:
> >>> Matthias,
> >>> Yes, I think that’s a sensible way forward and the interface you
> propose
> >> looks good. I’ll update the KIP accordingly.
> >>>
> >>> Thanks,
> >>> Andrew
> >>>
>  On 10 Oct 2023, at 23:01, Matthias J. Sax  wrote:
> 
>  Andrew,
> 
>  yes I would like to get this change into KIP-714 right way. Seems to
> be
> >> important, as we don't know if/when a follow-up KIP for Kafka Streams
> would
> >> land.
> 
>  I was also thinking (and discussed with a few others) how to expose
> it,
> >> and we would propose the following:
> 
>  We add a new method to `KafkaStreams` class:
> 
> public ClientsInstanceIds clientsInstanceIds(Duration timeout);
> 
>  The returned object is like below:
> 
>   public class ClientsInstanceIds {
> // we only have a single admin client per KS instance
> String adminInstanceId();
> 
> // we only have a single global consumer per KS instance (if any)
> // Optional<> because we might not have global-thread
> Optional globalConsumerInstanceId();
> 
> // return a  ClientInstanceId> mapping
> // for the underlying (restore-)consumers/producers
> Map mainConsumerInstanceIds();
> Map restoreConsumerInstanceIds();
> Map producerInstanceIds();
>  }
> 
>  For the `threadKey`, we would use some pattern like this:
> 
>   [Stream|StateUpdater]Thread-
> 
> 
>  Would this work from your POV?
> 
> 
> 
>  -Matthias
> 
> 
>  On 10/9/23 2:15 AM, Andrew Schofield wrote:
> > Hi Matthias,
> > Good point. Makes sense to me.
> > Is this something that can also be included in the proposed Kafka
> >> Streams follow-on KIP, or would you prefer that I add it to KIP-714?
> > I have a slight preference for the former to put all of the KS
> >> enhancements into a separate KIP.
> > Thanks,
> > Andrew
> >> On 7 Oct 2023, at 02:12, Matthias J. Sax  wrote:
> >>
> >> Thanks Andrew. SGTM.
> >>
> >> One point you did not address is the idea to add a method to
> >> `KafkaStreams` similar to the proposed `clientInstanceId()` that will be
> >> added to consumer/producer/admin clients.
> >>
> >> Without addressing this, Kafka Streams users won't have a way to get
> >> the assigned `instanceId` of the 

Re: [DISCUSS] KIP-714: Client metrics and observability

2023-10-12 Thread Andrew Schofield
Hi Matthias,
I’ll answer (1) to (3).

(1) The KIP uses the phrase “client instance id” and the method name mirrors 
that. Personally, I’m
comfortable with the current name.

(2) That’s a good point. I’ll update it to use a Kafka Uuid instead.

(3) Although it’s a trivial thing to add an overload with no timeout parameter, 
the fact that it doesn’t really fit in the
Producer interface makes me prefer not to. I’d rather keep the timeout explicit 
on the method and keep the signature the
same across all three client interfaces that implement it.

I’ll update the KIP now.l

Thanks,
Andrew

> On 12 Oct 2023, at 02:47, Matthias J. Sax  wrote:
>
> In can answer 130 and 131.
>
> 130) We cannot guarantee that all clients are already initialized due to race 
> conditions. We plan to not allow calling `KafkaStreams#clientsInstanceIds()` 
> when the state is not RUNNING (or REBALANCING) though -- guess this slipped 
> on the KIP and should be added? But because StreamThreads can be added 
> dynamically (and producer might be created dynamically at runtime; cf below), 
> we still cannot guarantee that all clients are already initialized when the 
> method is called. Of course, we assume that all clients are most likely 
> initialize on the happy path, and blocking calls to 
> `client.clientInstanceId()` should be rare.
>
> To address the worst case, we won't do a naive implementation and just loop 
> over all clients, but fan-out the call to the different StreamThreads (and 
> GlobalStreamThread if it exists), and use Futures to gather the results.
>
> Currently, `StreamThreads` has 3 clients (if ALOS or EOSv2 is used), so we 
> might do 3 blocking calls in the worst case (for EOSv1 we get a producer per 
> tasks, and we might end up doing more blocking calls if the producers are not 
> initialized yet). Note that EOSv1 is already deprecated, and we are also 
> working on thread refactoring that will reduce the number of client on 
> StreamThread to 2 -- and we have more refactoring planned to reduce the 
> number of clients even further.
>
> Inside `KafakStreams#clientsInstanceIds()` we might only do single blocking 
> call for the admin client (ie, `admin.clientInstanceId()`).
>
> I agree that we need to do some clever timeout management, but it seems to be 
> more of an implementation detail?
>
> Do you have any particular concerns, or does the proposed implementation as 
> sketched above address your question?
>
>
> 130) If the Topology does not have a global-state-store, there won't be a 
> GlobalThread and thus not global consumer. Thus, we return an Optional.
>
>
>
> On three related question for Andrew.
>
> (1) Why is the method called `clientInstanceId()` and not just plain 
> `instanceId()`?
>
> (2) Why so we return a `String` while but not a UUID type? The added protocol 
> request/response classes use UUIDs.
>
> (3) Would it make sense to have an overloaded `clientInstanceId()` method 
> that does not take any parameter but uses `default.api.timeout` config (this 
> config does no exist on the producer though, so we could only have it for 
> consumer and admin at this point). We could of course also add overloads like 
> this later if user request them (and/or add `default.api.timeout.ms` to the 
> producer, too).
>
> Btw: For KafkaStreams, I think `clientsInstanceIds` still makes sense as a 
> method name though, as `KafkaStreams` itself does not have an `instanceId` -- 
> we can also not have a timeout-less overload, because `KafkaStreams` does not 
> have a `default.api.timeout.ms` config either (and I don't think it make 
> sense to add).
>
>
>
> -Matthias
>
> On 10/11/23 5:07 PM, Jun Rao wrote:
>> Hi, Andrew,
>> Thanks for the updated KIP. Just a few more minor comments.
>> 130. KafkaStreams.clientsInstanceId(Duration timeout): Does it wait for all
>> consumer/producer/adminClient instances to be initialized? Are all those
>> instances created during KafkaStreams initialization?
>> 131. Why does globalConsumerInstanceId() return Optional while
>> other consumer instances don't return Optional?
>> 132. ClientMetricsSubscriptionRequestCount: Do we need this since we have a
>> set of generic metrics
>> (kafka.network:type=RequestMetrics,name=RequestsPerSec,request=*) that
>> report Request rate for every request type?
>> Thanks,
>> Jun
>> On Wed, Oct 11, 2023 at 1:47 PM Matthias J. Sax  wrote:
>>> Thanks!
>>>
>>> On 10/10/23 11:31 PM, Andrew Schofield wrote:
 Matthias,
 Yes, I think that’s a sensible way forward and the interface you propose
>>> looks good. I’ll update the KIP accordingly.

 Thanks,
 Andrew

> On 10 Oct 2023, at 23:01, Matthias J. Sax  wrote:
>
> Andrew,
>
> yes I would like to get this change into KIP-714 right way. Seems to be
>>> important, as we don't know if/when a follow-up KIP for Kafka Streams would
>>> land.
>
> I was also thinking (and discussed with a few others) how to expose it,
>>> and we would propose the following:
>

Re: [DISCUSS] KIP-714: Client metrics and observability

2023-10-12 Thread Andrew Schofield
Hi Jun,
Thanks for your comments.

130. As Matthias described, and I am adding to the KIP, the 
`KafkaStreams#clientInstanceIds` method
is only permitted when the state is RUNNING or REBALANCING. Also, clients can 
be added dynamically
so the maps might change over time. If it’s in a permitted state, the method is 
prepared to wait up to the
supplied timeout to get the client instance ids. It does not return a partial 
result - it returns a result or
fails.

131. I’ve refactored the `ClientsInstanceIds` object and the global consumer is 
now part of the map
of consumers. There is no need for the Optional any longer. I’ve also renamed 
it `ClientInstanceIds`.

132. My reading of 
`(kafka.network:type=RequestMetrics,name=RequestsPerSec,request=*)` is that
It does not support every request type - it supports Produce, FetchConsumer and 
FetchFollower.
Consequently, I think the ClientMetricsSubscriptionRequestCount is not 
instantly obsolete.

If I’ve misunderstood, please let me know.

Thanks,
Andrew


> On 12 Oct 2023, at 01:07, Jun Rao  wrote:
>
> Hi, Andrew,
>
> Thanks for the updated KIP. Just a few more minor comments.
>
> 130. KafkaStreams.clientsInstanceId(Duration timeout): Does it wait for all
> consumer/producer/adminClient instances to be initialized? Are all those
> instances created during KafkaStreams initialization?
>
> 131. Why does globalConsumerInstanceId() return Optional while
> other consumer instances don't return Optional?
>
> 132. ClientMetricsSubscriptionRequestCount: Do we need this since we have a
> set of generic metrics
> (kafka.network:type=RequestMetrics,name=RequestsPerSec,request=*) that
> report Request rate for every request type?
>
> Thanks,
>
> Jun
>
> On Wed, Oct 11, 2023 at 1:47 PM Matthias J. Sax  wrote:
>
>> Thanks!
>>
>> On 10/10/23 11:31 PM, Andrew Schofield wrote:
>>> Matthias,
>>> Yes, I think that’s a sensible way forward and the interface you propose
>> looks good. I’ll update the KIP accordingly.
>>>
>>> Thanks,
>>> Andrew
>>>
 On 10 Oct 2023, at 23:01, Matthias J. Sax  wrote:

 Andrew,

 yes I would like to get this change into KIP-714 right way. Seems to be
>> important, as we don't know if/when a follow-up KIP for Kafka Streams would
>> land.

 I was also thinking (and discussed with a few others) how to expose it,
>> and we would propose the following:

 We add a new method to `KafkaStreams` class:

public ClientsInstanceIds clientsInstanceIds(Duration timeout);

 The returned object is like below:

  public class ClientsInstanceIds {
// we only have a single admin client per KS instance
String adminInstanceId();

// we only have a single global consumer per KS instance (if any)
// Optional<> because we might not have global-thread
Optional globalConsumerInstanceId();

// return a  ClientInstanceId> mapping
// for the underlying (restore-)consumers/producers
Map mainConsumerInstanceIds();
Map restoreConsumerInstanceIds();
Map producerInstanceIds();
 }

 For the `threadKey`, we would use some pattern like this:

  [Stream|StateUpdater]Thread-


 Would this work from your POV?



 -Matthias


 On 10/9/23 2:15 AM, Andrew Schofield wrote:
> Hi Matthias,
> Good point. Makes sense to me.
> Is this something that can also be included in the proposed Kafka
>> Streams follow-on KIP, or would you prefer that I add it to KIP-714?
> I have a slight preference for the former to put all of the KS
>> enhancements into a separate KIP.
> Thanks,
> Andrew
>> On 7 Oct 2023, at 02:12, Matthias J. Sax  wrote:
>>
>> Thanks Andrew. SGTM.
>>
>> One point you did not address is the idea to add a method to
>> `KafkaStreams` similar to the proposed `clientInstanceId()` that will be
>> added to consumer/producer/admin clients.
>>
>> Without addressing this, Kafka Streams users won't have a way to get
>> the assigned `instanceId` of the internally created clients, and thus it
>> would be very difficult for them to know which metrics that the broker
>> receives belong to a Kafka Streams app. It seems they would only find the
>> `instanceIds` in the log4j output if they enable client logging?
>>
>> Of course, because there is multiple clients inside Kafka Streams,
>> the return type cannot be an single "String", but must be some some complex
>> data structure -- we could either add a new class, or return a
>> Map using a client key that maps to the `instanceId`.
>>
>> For example we could use the following key:
>>
>>   [Global]StreamThread[-][-restore][consumer|producer]
>>
>> (Of course, only the valid combination.)
>>
>> Or maybe even better, we might want to return a `Future` because
>> collection all the `instanceId` might be a blocking all on each client? I
>> have already a few idea 

Re: [DISCUSS] KIP-714: Client metrics and observability

2023-10-12 Thread Matthias J. Sax

Seems both Andrew and Jun prefer to merge the consumers. I am ok with this.

I'll leave it to Andrew to update the KIP accordingly, including adding 
`throws TimeoutException`.



-Matthias

On 10/12/23 10:07 AM, Jun Rao wrote:

Hi, Matthias,

130. Yes, throwing an exception sounds reasonable. It would be useful to
document this.

131. I was thinking that we could just return all consumers (including the
global consumer) through Map consumerInstanceIds() and use
keys to identify each consumer instance. The benefit is that the
implementation (whether to use a separate global consumer or not) could
change in the future, but the API can remain the same. Another slight
benefit is that there is no need for returning Optional. If the
global consumer is not used, it just won't be included in the map.

Thanks,

Jun


On Thu, Oct 12, 2023 at 9:30 AM Matthias J. Sax  wrote:


Thanks Sophie and Jun.

`clientInstanceIds()` is fine with me -- was not sure about the double
plural myself.

Sorry if my comments was confusing. I was trying to say, that adding a
overload to `KafkaStreams` that does not take a timeout parameter does
not make sense, because there is no `default.api.timeout.ms` config for
Kafka Streams, so users always need to pass in a timeout. (Same for
producer.)

For the implementation, I think KS would always call
`client.clientInstanceId(timeout)` and never rely on
`default.api.timeout.ms` though, so we can stay in control -- if a
timeout is passed by the user, it would always overwrite
`default.api.timeout.ms` on the consumer/admin and thus we should follow
the same semantics in Kafka Streams, and overwrite it explicitly when
calling `client.clientInstanceId()`.

The proposed API also makes sense to me. I was just wondering if we want
to extend it for client users -- for KS we won't need/use the
timeout-less overloads.



130) My intent was to throw a TimeoutException if we cannot get all
instanceIds, because it's the standard contract for timeouts. It would
also be hard to tell for a user, if a full or partial result was
returned (or we add a method `boolean isPartialResult()` to make it
easier for users).

If there is concerns/objections, I am also ok to return a partial result
-- it would require a change to the newly added `ClientInstanceIds`
return type -- for `adminInstanceId` we only return a `String` right now
-- we might need to change this to `Optional` so we are able to
return a partial result?


131) Of course we could, but I am not sure what we would gain? In the
end, implementation details would always leak because if we change the
number of consumer we use, we would return different keys in the `Map`.
Atm, the proposal implies that the same key might be used for the "main"
and "restore" consumer of the same thread -- but we can make keys unique
by adding a `-restore` suffix to the restore-consumer key if we merge
both maps. -- Curious to hear what others think. I am very open to do it
differently than currently proposed.


-Matthias


On 10/12/23 8:39 AM, Jun Rao wrote:

Hi, Matthias,

Thanks for the reply.

130. What would be the semantic? If the timeout has expired and only some
of the client instances' id have been retrieved, does the call return the
partial result or throw an exception?

131. Could we group all consumer instances in a single method since we

are

returning the key for each instance already? This probably also avoids
exposing implementation details that could change over time.

Thanks,

Jun

On Thu, Oct 12, 2023 at 12:00 AM Sophie Blee-Goldman <

sop...@responsive.dev>

wrote:


Regarding the naming, I personally think `clientInstanceId` makes sense

for

the plain clients
   -- especially if we might later introduce the notion of an
`applicationInstanceId`.

I'm not a huge fan of `clientsInstanceIds` for the Kafka Streams API,
though, can we use
`clientInstanceIds` instead? (The difference being the placement of the
plural 's')
I would similarly rename the class to just ClientInstanceIds

we can also not have a timeout-less overload,  because `KafkaStreams`

does

not have a `default.api.timeout.ms` config either


With respect to the timeout for the Kafka Streams API, I'm a bit

confused

by the
doubletriple-negative of Matthias' comment here, but I was thinking

about

this
earlier and this was my take: with the current proposal, we would allow
users to pass
in an absolute timeout as a parameter that would apply to the method as

a

whole.
Meanwhile within the method we would issue separate calls to each of the
clients using
the default or user-configured value of their  `default.api.timeout.ms`

as

the timeout
parameter.

So the API as proposed makes sense to me.


On Wed, Oct 11, 2023 at 6:48 PM Matthias J. Sax 

wrote:



In can answer 130 and 131.

130) We cannot guarantee that all clients are already initialized due

to

race conditions. We plan to not allow calling
`KafkaStreams#clientsInstanceIds()` when the state is not RUNNING (or
REBALANCING) though -- 

Re: [DISCUSS] KIP-714: Client metrics and observability

2023-10-12 Thread Jun Rao
Hi, Matthias,

130. Yes, throwing an exception sounds reasonable. It would be useful to
document this.

131. I was thinking that we could just return all consumers (including the
global consumer) through Map consumerInstanceIds() and use
keys to identify each consumer instance. The benefit is that the
implementation (whether to use a separate global consumer or not) could
change in the future, but the API can remain the same. Another slight
benefit is that there is no need for returning Optional. If the
global consumer is not used, it just won't be included in the map.

Thanks,

Jun


On Thu, Oct 12, 2023 at 9:30 AM Matthias J. Sax  wrote:

> Thanks Sophie and Jun.
>
> `clientInstanceIds()` is fine with me -- was not sure about the double
> plural myself.
>
> Sorry if my comments was confusing. I was trying to say, that adding a
> overload to `KafkaStreams` that does not take a timeout parameter does
> not make sense, because there is no `default.api.timeout.ms` config for
> Kafka Streams, so users always need to pass in a timeout. (Same for
> producer.)
>
> For the implementation, I think KS would always call
> `client.clientInstanceId(timeout)` and never rely on
> `default.api.timeout.ms` though, so we can stay in control -- if a
> timeout is passed by the user, it would always overwrite
> `default.api.timeout.ms` on the consumer/admin and thus we should follow
> the same semantics in Kafka Streams, and overwrite it explicitly when
> calling `client.clientInstanceId()`.
>
> The proposed API also makes sense to me. I was just wondering if we want
> to extend it for client users -- for KS we won't need/use the
> timeout-less overloads.
>
>
>
> 130) My intent was to throw a TimeoutException if we cannot get all
> instanceIds, because it's the standard contract for timeouts. It would
> also be hard to tell for a user, if a full or partial result was
> returned (or we add a method `boolean isPartialResult()` to make it
> easier for users).
>
> If there is concerns/objections, I am also ok to return a partial result
> -- it would require a change to the newly added `ClientInstanceIds`
> return type -- for `adminInstanceId` we only return a `String` right now
> -- we might need to change this to `Optional` so we are able to
> return a partial result?
>
>
> 131) Of course we could, but I am not sure what we would gain? In the
> end, implementation details would always leak because if we change the
> number of consumer we use, we would return different keys in the `Map`.
> Atm, the proposal implies that the same key might be used for the "main"
> and "restore" consumer of the same thread -- but we can make keys unique
> by adding a `-restore` suffix to the restore-consumer key if we merge
> both maps. -- Curious to hear what others think. I am very open to do it
> differently than currently proposed.
>
>
> -Matthias
>
>
> On 10/12/23 8:39 AM, Jun Rao wrote:
> > Hi, Matthias,
> >
> > Thanks for the reply.
> >
> > 130. What would be the semantic? If the timeout has expired and only some
> > of the client instances' id have been retrieved, does the call return the
> > partial result or throw an exception?
> >
> > 131. Could we group all consumer instances in a single method since we
> are
> > returning the key for each instance already? This probably also avoids
> > exposing implementation details that could change over time.
> >
> > Thanks,
> >
> > Jun
> >
> > On Thu, Oct 12, 2023 at 12:00 AM Sophie Blee-Goldman <
> sop...@responsive.dev>
> > wrote:
> >
> >> Regarding the naming, I personally think `clientInstanceId` makes sense
> for
> >> the plain clients
> >>   -- especially if we might later introduce the notion of an
> >> `applicationInstanceId`.
> >>
> >> I'm not a huge fan of `clientsInstanceIds` for the Kafka Streams API,
> >> though, can we use
> >> `clientInstanceIds` instead? (The difference being the placement of the
> >> plural 's')
> >> I would similarly rename the class to just ClientInstanceIds
> >>
> >> we can also not have a timeout-less overload,  because `KafkaStreams`
> does
> >>> not have a `default.api.timeout.ms` config either
> >>
> >> With respect to the timeout for the Kafka Streams API, I'm a bit
> confused
> >> by the
> >> doubletriple-negative of Matthias' comment here, but I was thinking
> about
> >> this
> >> earlier and this was my take: with the current proposal, we would allow
> >> users to pass
> >> in an absolute timeout as a parameter that would apply to the method as
> a
> >> whole.
> >> Meanwhile within the method we would issue separate calls to each of the
> >> clients using
> >> the default or user-configured value of their  `default.api.timeout.ms`
> as
> >> the timeout
> >> parameter.
> >>
> >> So the API as proposed makes sense to me.
> >>
> >>
> >> On Wed, Oct 11, 2023 at 6:48 PM Matthias J. Sax 
> wrote:
> >>
> >>> In can answer 130 and 131.
> >>>
> >>> 130) We cannot guarantee that all clients are already initialized due
> to
> >>> race conditions. We plan to not 

Re: [DISCUSS] KIP-714: Client metrics and observability

2023-10-12 Thread Andrew Schofield
Hi Matthias,
131) I also think that separating the main and restore consumers is excessively 
specific about the current implementation.
So, I think I’d prefer:

public class ClientInstanceIds {
  String adminInstanceId();

  Optional globalConsumerInstanceId();

  Map consumerInstanceIds();

  Map producerInstanceIds();
}

I’m not sure whether it makes sense to combine the global consumer instance id 
too.

I’ve got rid of the double plural :)

My 2 cents.

Thanks,
Andrew

> On 12 Oct 2023, at 17:29, Matthias J. Sax  wrote:
>
> Thanks Sophie and Jun.
>
> `clientInstanceIds()` is fine with me -- was not sure about the double plural 
> myself.
>
> Sorry if my comments was confusing. I was trying to say, that adding a 
> overload to `KafkaStreams` that does not take a timeout parameter does not 
> make sense, because there is no `default.api.timeout.ms` config for Kafka 
> Streams, so users always need to pass in a timeout. (Same for producer.)
>
> For the implementation, I think KS would always call 
> `client.clientInstanceId(timeout)` and never rely on `default.api.timeout.ms` 
> though, so we can stay in control -- if a timeout is passed by the user, it 
> would always overwrite `default.api.timeout.ms` on the consumer/admin and 
> thus we should follow the same semantics in Kafka Streams, and overwrite it 
> explicitly when calling `client.clientInstanceId()`.
>
> The proposed API also makes sense to me. I was just wondering if we want to 
> extend it for client users -- for KS we won't need/use the timeout-less 
> overloads.
>
>
>
> 130) My intent was to throw a TimeoutException if we cannot get all 
> instanceIds, because it's the standard contract for timeouts. It would also 
> be hard to tell for a user, if a full or partial result was returned (or we 
> add a method `boolean isPartialResult()` to make it easier for users).
>
> If there is concerns/objections, I am also ok to return a partial result -- 
> it would require a change to the newly added `ClientInstanceIds` return type 
> -- for `adminInstanceId` we only return a `String` right now -- we might need 
> to change this to `Optional` so we are able to return a partial 
> result?
>
>
> 131) Of course we could, but I am not sure what we would gain? In the end, 
> implementation details would always leak because if we change the number of 
> consumer we use, we would return different keys in the `Map`. Atm, the 
> proposal implies that the same key might be used for the "main" and "restore" 
> consumer of the same thread -- but we can make keys unique by adding a 
> `-restore` suffix to the restore-consumer key if we merge both maps. -- 
> Curious to hear what others think. I am very open to do it differently than 
> currently proposed.
>
>
> -Matthias
>
>
> On 10/12/23 8:39 AM, Jun Rao wrote:
>> Hi, Matthias,
>> Thanks for the reply.
>> 130. What would be the semantic? If the timeout has expired and only some
>> of the client instances' id have been retrieved, does the call return the
>> partial result or throw an exception?
>> 131. Could we group all consumer instances in a single method since we are
>> returning the key for each instance already? This probably also avoids
>> exposing implementation details that could change over time.
>> Thanks,
>> Jun
>> On Thu, Oct 12, 2023 at 12:00 AM Sophie Blee-Goldman 
>> wrote:
>>> Regarding the naming, I personally think `clientInstanceId` makes sense for
>>> the plain clients
>>>  -- especially if we might later introduce the notion of an
>>> `applicationInstanceId`.
>>>
>>> I'm not a huge fan of `clientsInstanceIds` for the Kafka Streams API,
>>> though, can we use
>>> `clientInstanceIds` instead? (The difference being the placement of the
>>> plural 's')
>>> I would similarly rename the class to just ClientInstanceIds
>>>
>>> we can also not have a timeout-less overload,  because `KafkaStreams` does
 not have a `default.api.timeout.ms` config either
>>>
>>> With respect to the timeout for the Kafka Streams API, I'm a bit confused
>>> by the
>>> doubletriple-negative of Matthias' comment here, but I was thinking about
>>> this
>>> earlier and this was my take: with the current proposal, we would allow
>>> users to pass
>>> in an absolute timeout as a parameter that would apply to the method as a
>>> whole.
>>> Meanwhile within the method we would issue separate calls to each of the
>>> clients using
>>> the default or user-configured value of their  `default.api.timeout.ms` as
>>> the timeout
>>> parameter.
>>>
>>> So the API as proposed makes sense to me.
>>>
>>>
>>> On Wed, Oct 11, 2023 at 6:48 PM Matthias J. Sax  wrote:
>>>
 In can answer 130 and 131.

 130) We cannot guarantee that all clients are already initialized due to
 race conditions. We plan to not allow calling
 `KafkaStreams#clientsInstanceIds()` when the state is not RUNNING (or
 REBALANCING) though -- guess this slipped on the KIP and should be
 added? But because StreamThreads can be added 

Re: [DISCUSS] KIP-714: Client metrics and observability

2023-10-12 Thread Matthias J. Sax

Thanks Sophie and Jun.

`clientInstanceIds()` is fine with me -- was not sure about the double 
plural myself.


Sorry if my comments was confusing. I was trying to say, that adding a 
overload to `KafkaStreams` that does not take a timeout parameter does 
not make sense, because there is no `default.api.timeout.ms` config for 
Kafka Streams, so users always need to pass in a timeout. (Same for 
producer.)


For the implementation, I think KS would always call 
`client.clientInstanceId(timeout)` and never rely on 
`default.api.timeout.ms` though, so we can stay in control -- if a 
timeout is passed by the user, it would always overwrite 
`default.api.timeout.ms` on the consumer/admin and thus we should follow 
the same semantics in Kafka Streams, and overwrite it explicitly when 
calling `client.clientInstanceId()`.


The proposed API also makes sense to me. I was just wondering if we want 
to extend it for client users -- for KS we won't need/use the 
timeout-less overloads.




130) My intent was to throw a TimeoutException if we cannot get all 
instanceIds, because it's the standard contract for timeouts. It would 
also be hard to tell for a user, if a full or partial result was 
returned (or we add a method `boolean isPartialResult()` to make it 
easier for users).


If there is concerns/objections, I am also ok to return a partial result 
-- it would require a change to the newly added `ClientInstanceIds` 
return type -- for `adminInstanceId` we only return a `String` right now 
-- we might need to change this to `Optional` so we are able to 
return a partial result?



131) Of course we could, but I am not sure what we would gain? In the 
end, implementation details would always leak because if we change the 
number of consumer we use, we would return different keys in the `Map`. 
Atm, the proposal implies that the same key might be used for the "main" 
and "restore" consumer of the same thread -- but we can make keys unique 
by adding a `-restore` suffix to the restore-consumer key if we merge 
both maps. -- Curious to hear what others think. I am very open to do it 
differently than currently proposed.



-Matthias


On 10/12/23 8:39 AM, Jun Rao wrote:

Hi, Matthias,

Thanks for the reply.

130. What would be the semantic? If the timeout has expired and only some
of the client instances' id have been retrieved, does the call return the
partial result or throw an exception?

131. Could we group all consumer instances in a single method since we are
returning the key for each instance already? This probably also avoids
exposing implementation details that could change over time.

Thanks,

Jun

On Thu, Oct 12, 2023 at 12:00 AM Sophie Blee-Goldman 
wrote:


Regarding the naming, I personally think `clientInstanceId` makes sense for
the plain clients
  -- especially if we might later introduce the notion of an
`applicationInstanceId`.

I'm not a huge fan of `clientsInstanceIds` for the Kafka Streams API,
though, can we use
`clientInstanceIds` instead? (The difference being the placement of the
plural 's')
I would similarly rename the class to just ClientInstanceIds

we can also not have a timeout-less overload,  because `KafkaStreams` does

not have a `default.api.timeout.ms` config either


With respect to the timeout for the Kafka Streams API, I'm a bit confused
by the
doubletriple-negative of Matthias' comment here, but I was thinking about
this
earlier and this was my take: with the current proposal, we would allow
users to pass
in an absolute timeout as a parameter that would apply to the method as a
whole.
Meanwhile within the method we would issue separate calls to each of the
clients using
the default or user-configured value of their  `default.api.timeout.ms` as
the timeout
parameter.

So the API as proposed makes sense to me.


On Wed, Oct 11, 2023 at 6:48 PM Matthias J. Sax  wrote:


In can answer 130 and 131.

130) We cannot guarantee that all clients are already initialized due to
race conditions. We plan to not allow calling
`KafkaStreams#clientsInstanceIds()` when the state is not RUNNING (or
REBALANCING) though -- guess this slipped on the KIP and should be
added? But because StreamThreads can be added dynamically (and producer
might be created dynamically at runtime; cf below), we still cannot
guarantee that all clients are already initialized when the method is
called. Of course, we assume that all clients are most likely initialize
on the happy path, and blocking calls to `client.clientInstanceId()`
should be rare.

To address the worst case, we won't do a naive implementation and just
loop over all clients, but fan-out the call to the different
StreamThreads (and GlobalStreamThread if it exists), and use Futures to
gather the results.

Currently, `StreamThreads` has 3 clients (if ALOS or EOSv2 is used), so
we might do 3 blocking calls in the worst case (for EOSv1 we get a
producer per tasks, and we might end up doing more blocking calls if the
producers are not initialized yet). 

Re: [DISCUSS] KIP-714: Client metrics and observability

2023-10-12 Thread Jun Rao
Hi, Matthias,

Thanks for the reply.

130. What would be the semantic? If the timeout has expired and only some
of the client instances' id have been retrieved, does the call return the
partial result or throw an exception?

131. Could we group all consumer instances in a single method since we are
returning the key for each instance already? This probably also avoids
exposing implementation details that could change over time.

Thanks,

Jun

On Thu, Oct 12, 2023 at 12:00 AM Sophie Blee-Goldman 
wrote:

> Regarding the naming, I personally think `clientInstanceId` makes sense for
> the plain clients
>  -- especially if we might later introduce the notion of an
> `applicationInstanceId`.
>
> I'm not a huge fan of `clientsInstanceIds` for the Kafka Streams API,
> though, can we use
> `clientInstanceIds` instead? (The difference being the placement of the
> plural 's')
> I would similarly rename the class to just ClientInstanceIds
>
> we can also not have a timeout-less overload,  because `KafkaStreams` does
> > not have a `default.api.timeout.ms` config either
>
> With respect to the timeout for the Kafka Streams API, I'm a bit confused
> by the
> doubletriple-negative of Matthias' comment here, but I was thinking about
> this
> earlier and this was my take: with the current proposal, we would allow
> users to pass
> in an absolute timeout as a parameter that would apply to the method as a
> whole.
> Meanwhile within the method we would issue separate calls to each of the
> clients using
> the default or user-configured value of their  `default.api.timeout.ms` as
> the timeout
> parameter.
>
> So the API as proposed makes sense to me.
>
>
> On Wed, Oct 11, 2023 at 6:48 PM Matthias J. Sax  wrote:
>
> > In can answer 130 and 131.
> >
> > 130) We cannot guarantee that all clients are already initialized due to
> > race conditions. We plan to not allow calling
> > `KafkaStreams#clientsInstanceIds()` when the state is not RUNNING (or
> > REBALANCING) though -- guess this slipped on the KIP and should be
> > added? But because StreamThreads can be added dynamically (and producer
> > might be created dynamically at runtime; cf below), we still cannot
> > guarantee that all clients are already initialized when the method is
> > called. Of course, we assume that all clients are most likely initialize
> > on the happy path, and blocking calls to `client.clientInstanceId()`
> > should be rare.
> >
> > To address the worst case, we won't do a naive implementation and just
> > loop over all clients, but fan-out the call to the different
> > StreamThreads (and GlobalStreamThread if it exists), and use Futures to
> > gather the results.
> >
> > Currently, `StreamThreads` has 3 clients (if ALOS or EOSv2 is used), so
> > we might do 3 blocking calls in the worst case (for EOSv1 we get a
> > producer per tasks, and we might end up doing more blocking calls if the
> > producers are not initialized yet). Note that EOSv1 is already
> > deprecated, and we are also working on thread refactoring that will
> > reduce the number of client on StreamThread to 2 -- and we have more
> > refactoring planned to reduce the number of clients even further.
> >
> > Inside `KafakStreams#clientsInstanceIds()` we might only do single
> > blocking call for the admin client (ie, `admin.clientInstanceId()`).
> >
> > I agree that we need to do some clever timeout management, but it seems
> > to be more of an implementation detail?
> >
> > Do you have any particular concerns, or does the proposed implementation
> > as sketched above address your question?
> >
> >
> > 130) If the Topology does not have a global-state-store, there won't be
> > a GlobalThread and thus not global consumer. Thus, we return an Optional.
> >
> >
> >
> > On three related question for Andrew.
> >
> > (1) Why is the method called `clientInstanceId()` and not just plain
> > `instanceId()`?
> >
> > (2) Why so we return a `String` while but not a UUID type? The added
> > protocol request/response classes use UUIDs.
> >
> > (3) Would it make sense to have an overloaded `clientInstanceId()`
> > method that does not take any parameter but uses `default.api.timeout`
> > config (this config does no exist on the producer though, so we could
> > only have it for consumer and admin at this point). We could of course
> > also add overloads like this later if user request them (and/or add
> > `default.api.timeout.ms` to the producer, too).
> >
> > Btw: For KafkaStreams, I think `clientsInstanceIds` still makes sense as
> > a method name though, as `KafkaStreams` itself does not have an
> > `instanceId` -- we can also not have a timeout-less overload, because
> > `KafkaStreams` does not have a `default.api.timeout.ms` config either
> > (and I don't think it make sense to add).
> >
> >
> >
> > -Matthias
> >
> > On 10/11/23 5:07 PM, Jun Rao wrote:
> > > Hi, Andrew,
> > >
> > > Thanks for the updated KIP. Just a few more minor comments.
> > >
> > > 130. KafkaStreams.clientsInstanceId(Duration 

Re: [DISCUSS] KIP-714: Client metrics and observability

2023-10-12 Thread Sophie Blee-Goldman
Regarding the naming, I personally think `clientInstanceId` makes sense for
the plain clients
 -- especially if we might later introduce the notion of an
`applicationInstanceId`.

I'm not a huge fan of `clientsInstanceIds` for the Kafka Streams API,
though, can we use
`clientInstanceIds` instead? (The difference being the placement of the
plural 's')
I would similarly rename the class to just ClientInstanceIds

we can also not have a timeout-less overload,  because `KafkaStreams` does
> not have a `default.api.timeout.ms` config either

With respect to the timeout for the Kafka Streams API, I'm a bit confused
by the
doubletriple-negative of Matthias' comment here, but I was thinking about
this
earlier and this was my take: with the current proposal, we would allow
users to pass
in an absolute timeout as a parameter that would apply to the method as a
whole.
Meanwhile within the method we would issue separate calls to each of the
clients using
the default or user-configured value of their  `default.api.timeout.ms` as
the timeout
parameter.

So the API as proposed makes sense to me.


On Wed, Oct 11, 2023 at 6:48 PM Matthias J. Sax  wrote:

> In can answer 130 and 131.
>
> 130) We cannot guarantee that all clients are already initialized due to
> race conditions. We plan to not allow calling
> `KafkaStreams#clientsInstanceIds()` when the state is not RUNNING (or
> REBALANCING) though -- guess this slipped on the KIP and should be
> added? But because StreamThreads can be added dynamically (and producer
> might be created dynamically at runtime; cf below), we still cannot
> guarantee that all clients are already initialized when the method is
> called. Of course, we assume that all clients are most likely initialize
> on the happy path, and blocking calls to `client.clientInstanceId()`
> should be rare.
>
> To address the worst case, we won't do a naive implementation and just
> loop over all clients, but fan-out the call to the different
> StreamThreads (and GlobalStreamThread if it exists), and use Futures to
> gather the results.
>
> Currently, `StreamThreads` has 3 clients (if ALOS or EOSv2 is used), so
> we might do 3 blocking calls in the worst case (for EOSv1 we get a
> producer per tasks, and we might end up doing more blocking calls if the
> producers are not initialized yet). Note that EOSv1 is already
> deprecated, and we are also working on thread refactoring that will
> reduce the number of client on StreamThread to 2 -- and we have more
> refactoring planned to reduce the number of clients even further.
>
> Inside `KafakStreams#clientsInstanceIds()` we might only do single
> blocking call for the admin client (ie, `admin.clientInstanceId()`).
>
> I agree that we need to do some clever timeout management, but it seems
> to be more of an implementation detail?
>
> Do you have any particular concerns, or does the proposed implementation
> as sketched above address your question?
>
>
> 130) If the Topology does not have a global-state-store, there won't be
> a GlobalThread and thus not global consumer. Thus, we return an Optional.
>
>
>
> On three related question for Andrew.
>
> (1) Why is the method called `clientInstanceId()` and not just plain
> `instanceId()`?
>
> (2) Why so we return a `String` while but not a UUID type? The added
> protocol request/response classes use UUIDs.
>
> (3) Would it make sense to have an overloaded `clientInstanceId()`
> method that does not take any parameter but uses `default.api.timeout`
> config (this config does no exist on the producer though, so we could
> only have it for consumer and admin at this point). We could of course
> also add overloads like this later if user request them (and/or add
> `default.api.timeout.ms` to the producer, too).
>
> Btw: For KafkaStreams, I think `clientsInstanceIds` still makes sense as
> a method name though, as `KafkaStreams` itself does not have an
> `instanceId` -- we can also not have a timeout-less overload, because
> `KafkaStreams` does not have a `default.api.timeout.ms` config either
> (and I don't think it make sense to add).
>
>
>
> -Matthias
>
> On 10/11/23 5:07 PM, Jun Rao wrote:
> > Hi, Andrew,
> >
> > Thanks for the updated KIP. Just a few more minor comments.
> >
> > 130. KafkaStreams.clientsInstanceId(Duration timeout): Does it wait for
> all
> > consumer/producer/adminClient instances to be initialized? Are all those
> > instances created during KafkaStreams initialization?
> >
> > 131. Why does globalConsumerInstanceId() return Optional while
> > other consumer instances don't return Optional?
> >
> > 132. ClientMetricsSubscriptionRequestCount: Do we need this since we
> have a
> > set of generic metrics
> > (kafka.network:type=RequestMetrics,name=RequestsPerSec,request=*) that
> > report Request rate for every request type?
> >
> > Thanks,
> >
> > Jun
> >
> > On Wed, Oct 11, 2023 at 1:47 PM Matthias J. Sax 
> wrote:
> >
> >> Thanks!
> >>
> >> On 10/10/23 11:31 PM, Andrew Schofield wrote:
> >>> Matthias,
> 

Re: [DISCUSS] KIP-714: Client metrics and observability

2023-10-11 Thread Matthias J. Sax

In can answer 130 and 131.

130) We cannot guarantee that all clients are already initialized due to 
race conditions. We plan to not allow calling 
`KafkaStreams#clientsInstanceIds()` when the state is not RUNNING (or 
REBALANCING) though -- guess this slipped on the KIP and should be 
added? But because StreamThreads can be added dynamically (and producer 
might be created dynamically at runtime; cf below), we still cannot 
guarantee that all clients are already initialized when the method is 
called. Of course, we assume that all clients are most likely initialize 
on the happy path, and blocking calls to `client.clientInstanceId()` 
should be rare.


To address the worst case, we won't do a naive implementation and just 
loop over all clients, but fan-out the call to the different 
StreamThreads (and GlobalStreamThread if it exists), and use Futures to 
gather the results.


Currently, `StreamThreads` has 3 clients (if ALOS or EOSv2 is used), so 
we might do 3 blocking calls in the worst case (for EOSv1 we get a 
producer per tasks, and we might end up doing more blocking calls if the 
producers are not initialized yet). Note that EOSv1 is already 
deprecated, and we are also working on thread refactoring that will 
reduce the number of client on StreamThread to 2 -- and we have more 
refactoring planned to reduce the number of clients even further.


Inside `KafakStreams#clientsInstanceIds()` we might only do single 
blocking call for the admin client (ie, `admin.clientInstanceId()`).


I agree that we need to do some clever timeout management, but it seems 
to be more of an implementation detail?


Do you have any particular concerns, or does the proposed implementation 
as sketched above address your question?



130) If the Topology does not have a global-state-store, there won't be 
a GlobalThread and thus not global consumer. Thus, we return an Optional.




On three related question for Andrew.

(1) Why is the method called `clientInstanceId()` and not just plain 
`instanceId()`?


(2) Why so we return a `String` while but not a UUID type? The added 
protocol request/response classes use UUIDs.


(3) Would it make sense to have an overloaded `clientInstanceId()` 
method that does not take any parameter but uses `default.api.timeout` 
config (this config does no exist on the producer though, so we could 
only have it for consumer and admin at this point). We could of course 
also add overloads like this later if user request them (and/or add 
`default.api.timeout.ms` to the producer, too).


Btw: For KafkaStreams, I think `clientsInstanceIds` still makes sense as 
a method name though, as `KafkaStreams` itself does not have an 
`instanceId` -- we can also not have a timeout-less overload, because 
`KafkaStreams` does not have a `default.api.timeout.ms` config either 
(and I don't think it make sense to add).




-Matthias

On 10/11/23 5:07 PM, Jun Rao wrote:

Hi, Andrew,

Thanks for the updated KIP. Just a few more minor comments.

130. KafkaStreams.clientsInstanceId(Duration timeout): Does it wait for all
consumer/producer/adminClient instances to be initialized? Are all those
instances created during KafkaStreams initialization?

131. Why does globalConsumerInstanceId() return Optional while
other consumer instances don't return Optional?

132. ClientMetricsSubscriptionRequestCount: Do we need this since we have a
set of generic metrics
(kafka.network:type=RequestMetrics,name=RequestsPerSec,request=*) that
report Request rate for every request type?

Thanks,

Jun

On Wed, Oct 11, 2023 at 1:47 PM Matthias J. Sax  wrote:


Thanks!

On 10/10/23 11:31 PM, Andrew Schofield wrote:

Matthias,
Yes, I think that’s a sensible way forward and the interface you propose

looks good. I’ll update the KIP accordingly.


Thanks,
Andrew


On 10 Oct 2023, at 23:01, Matthias J. Sax  wrote:

Andrew,

yes I would like to get this change into KIP-714 right way. Seems to be

important, as we don't know if/when a follow-up KIP for Kafka Streams would
land.


I was also thinking (and discussed with a few others) how to expose it,

and we would propose the following:


We add a new method to `KafkaStreams` class:

 public ClientsInstanceIds clientsInstanceIds(Duration timeout);

The returned object is like below:

   public class ClientsInstanceIds {
 // we only have a single admin client per KS instance
 String adminInstanceId();

 // we only have a single global consumer per KS instance (if any)
 // Optional<> because we might not have global-thread
 Optional globalConsumerInstanceId();

 // return a  ClientInstanceId> mapping
 // for the underlying (restore-)consumers/producers
 Map mainConsumerInstanceIds();
 Map restoreConsumerInstanceIds();
 Map producerInstanceIds();
}

For the `threadKey`, we would use some pattern like this:

   [Stream|StateUpdater]Thread-


Would this work from your POV?



-Matthias


On 10/9/23 2:15 AM, Andrew Schofield wrote:

Hi Matthias,
Good 

Re: [DISCUSS] KIP-714: Client metrics and observability

2023-10-11 Thread Jun Rao
Hi, Andrew,

Thanks for the updated KIP. Just a few more minor comments.

130. KafkaStreams.clientsInstanceId(Duration timeout): Does it wait for all
consumer/producer/adminClient instances to be initialized? Are all those
instances created during KafkaStreams initialization?

131. Why does globalConsumerInstanceId() return Optional while
other consumer instances don't return Optional?

132. ClientMetricsSubscriptionRequestCount: Do we need this since we have a
set of generic metrics
(kafka.network:type=RequestMetrics,name=RequestsPerSec,request=*) that
report Request rate for every request type?

Thanks,

Jun

On Wed, Oct 11, 2023 at 1:47 PM Matthias J. Sax  wrote:

> Thanks!
>
> On 10/10/23 11:31 PM, Andrew Schofield wrote:
> > Matthias,
> > Yes, I think that’s a sensible way forward and the interface you propose
> looks good. I’ll update the KIP accordingly.
> >
> > Thanks,
> > Andrew
> >
> >> On 10 Oct 2023, at 23:01, Matthias J. Sax  wrote:
> >>
> >> Andrew,
> >>
> >> yes I would like to get this change into KIP-714 right way. Seems to be
> important, as we don't know if/when a follow-up KIP for Kafka Streams would
> land.
> >>
> >> I was also thinking (and discussed with a few others) how to expose it,
> and we would propose the following:
> >>
> >> We add a new method to `KafkaStreams` class:
> >>
> >> public ClientsInstanceIds clientsInstanceIds(Duration timeout);
> >>
> >> The returned object is like below:
> >>
> >>   public class ClientsInstanceIds {
> >> // we only have a single admin client per KS instance
> >> String adminInstanceId();
> >>
> >> // we only have a single global consumer per KS instance (if any)
> >> // Optional<> because we might not have global-thread
> >> Optional globalConsumerInstanceId();
> >>
> >> // return a  ClientInstanceId> mapping
> >> // for the underlying (restore-)consumers/producers
> >> Map mainConsumerInstanceIds();
> >> Map restoreConsumerInstanceIds();
> >> Map producerInstanceIds();
> >> }
> >>
> >> For the `threadKey`, we would use some pattern like this:
> >>
> >>   [Stream|StateUpdater]Thread-
> >>
> >>
> >> Would this work from your POV?
> >>
> >>
> >>
> >> -Matthias
> >>
> >>
> >> On 10/9/23 2:15 AM, Andrew Schofield wrote:
> >>> Hi Matthias,
> >>> Good point. Makes sense to me.
> >>> Is this something that can also be included in the proposed Kafka
> Streams follow-on KIP, or would you prefer that I add it to KIP-714?
> >>> I have a slight preference for the former to put all of the KS
> enhancements into a separate KIP.
> >>> Thanks,
> >>> Andrew
>  On 7 Oct 2023, at 02:12, Matthias J. Sax  wrote:
> 
>  Thanks Andrew. SGTM.
> 
>  One point you did not address is the idea to add a method to
> `KafkaStreams` similar to the proposed `clientInstanceId()` that will be
> added to consumer/producer/admin clients.
> 
>  Without addressing this, Kafka Streams users won't have a way to get
> the assigned `instanceId` of the internally created clients, and thus it
> would be very difficult for them to know which metrics that the broker
> receives belong to a Kafka Streams app. It seems they would only find the
> `instanceIds` in the log4j output if they enable client logging?
> 
>  Of course, because there is multiple clients inside Kafka Streams,
> the return type cannot be an single "String", but must be some some complex
> data structure -- we could either add a new class, or return a
> Map using a client key that maps to the `instanceId`.
> 
>  For example we could use the following key:
> 
> [Global]StreamThread[-][-restore][consumer|producer]
> 
>  (Of course, only the valid combination.)
> 
>  Or maybe even better, we might want to return a `Future` because
> collection all the `instanceId` might be a blocking all on each client? I
> have already a few idea how it could be implemented but I don't think it
> must be discussed on the KIP, as it's an implementation detail.
> 
>  Thoughts?
> 
> 
>  -Matthias
> 
>  On 10/6/23 4:21 AM, Andrew Schofield wrote:
> > Hi Matthias,
> > Thanks for your comments. I agree that a follow-up KIP for Kafka
> Streams makes sense. This KIP currently has made a bit
> > of an effort to embrace KS, but it’s not enough by a long way.
> > I have removed `application.id `. This
> should be done properly in the follow-up KIP. I don’t believe there’s a
> downside to
> > removing it from this KIP.
> > I have reworded the statement about temporarily. In practice, the
> implementation of this KIP that’s going on while the voting
> > progresses happens to use delta temporality, but that’s an
> implementation detail. Supporting clients must support both
> > temporalities.
> > I thought about exposing the client instance ID as a metric, but
> non-numeric metrics are not usual practice and tools
> > do not universally support them. I 

Re: [DISCUSS] KIP-714: Client metrics and observability

2023-10-11 Thread Matthias J. Sax

Thanks!

On 10/10/23 11:31 PM, Andrew Schofield wrote:

Matthias,
Yes, I think that’s a sensible way forward and the interface you propose looks 
good. I’ll update the KIP accordingly.

Thanks,
Andrew


On 10 Oct 2023, at 23:01, Matthias J. Sax  wrote:

Andrew,

yes I would like to get this change into KIP-714 right way. Seems to be 
important, as we don't know if/when a follow-up KIP for Kafka Streams would 
land.

I was also thinking (and discussed with a few others) how to expose it, and we 
would propose the following:

We add a new method to `KafkaStreams` class:

public ClientsInstanceIds clientsInstanceIds(Duration timeout);

The returned object is like below:

  public class ClientsInstanceIds {
// we only have a single admin client per KS instance
String adminInstanceId();

// we only have a single global consumer per KS instance (if any)
// Optional<> because we might not have global-thread
Optional globalConsumerInstanceId();

// return a  ClientInstanceId> mapping
// for the underlying (restore-)consumers/producers
Map mainConsumerInstanceIds();
Map restoreConsumerInstanceIds();
Map producerInstanceIds();
}

For the `threadKey`, we would use some pattern like this:

  [Stream|StateUpdater]Thread-


Would this work from your POV?



-Matthias


On 10/9/23 2:15 AM, Andrew Schofield wrote:

Hi Matthias,
Good point. Makes sense to me.
Is this something that can also be included in the proposed Kafka Streams 
follow-on KIP, or would you prefer that I add it to KIP-714?
I have a slight preference for the former to put all of the KS enhancements 
into a separate KIP.
Thanks,
Andrew

On 7 Oct 2023, at 02:12, Matthias J. Sax  wrote:

Thanks Andrew. SGTM.

One point you did not address is the idea to add a method to `KafkaStreams` 
similar to the proposed `clientInstanceId()` that will be added to 
consumer/producer/admin clients.

Without addressing this, Kafka Streams users won't have a way to get the 
assigned `instanceId` of the internally created clients, and thus it would be 
very difficult for them to know which metrics that the broker receives belong 
to a Kafka Streams app. It seems they would only find the `instanceIds` in the 
log4j output if they enable client logging?

Of course, because there is multiple clients inside Kafka Streams, the return type cannot be an 
single "String", but must be some some complex data structure -- we could either add 
a new class, or return a Map using a client key that maps to the 
`instanceId`.

For example we could use the following key:

   [Global]StreamThread[-][-restore][consumer|producer]

(Of course, only the valid combination.)

Or maybe even better, we might want to return a `Future` because collection all 
the `instanceId` might be a blocking all on each client? I have already a few 
idea how it could be implemented but I don't think it must be discussed on the 
KIP, as it's an implementation detail.

Thoughts?


-Matthias

On 10/6/23 4:21 AM, Andrew Schofield wrote:

Hi Matthias,
Thanks for your comments. I agree that a follow-up KIP for Kafka Streams makes 
sense. This KIP currently has made a bit
of an effort to embrace KS, but it’s not enough by a long way.
I have removed `application.id `. This should be done 
properly in the follow-up KIP. I don’t believe there’s a downside to
removing it from this KIP.
I have reworded the statement about temporarily. In practice, the 
implementation of this KIP that’s going on while the voting
progresses happens to use delta temporality, but that’s an implementation 
detail. Supporting clients must support both
temporalities.
I thought about exposing the client instance ID as a metric, but non-numeric 
metrics are not usual practice and tools
do not universally support them. I don’t think the KIP is improved by adding 
one now.
I have also added constants for the various Config classes for 
ENABLE_METRICS_PUSH_CONFIG, including to
StreamsConfig. It’s best to be explicit about this.
Thanks,
Andrew

On 2 Oct 2023, at 23:47, Matthias J. Sax  wrote:

Hi,

I did not pay attention to this KIP in the past; seems it was on-hold for a 
while.

Overall it sounds very useful, and I think we should extend this with a follow 
up KIP for Kafka Streams. What is unclear to me at this point is the statement:


Kafka Streams applications have an application.id configured and this 
identifier should be included as the application_id metrics label.


The `application.id` is currently only used as the (main) consumer's `group.id` 
(and is part of an auto-generated `client.id` if the user does not set one).

This comment related to:


The following labels should be added by the client as appropriate before 
metrics are pushed.


Given that Kafka Streams uses the consumer/producer/admin client as "black 
boxes", a client does at this point not know that it's part of a Kafka Streams 
application, and thus, it won't be able to attach any such label to the metrics 

Re: [DISCUSS] KIP-714: Client metrics and observability

2023-10-11 Thread Andrew Schofield
Matthias,
Yes, I think that’s a sensible way forward and the interface you propose looks 
good. I’ll update the KIP accordingly.

Thanks,
Andrew

> On 10 Oct 2023, at 23:01, Matthias J. Sax  wrote:
>
> Andrew,
>
> yes I would like to get this change into KIP-714 right way. Seems to be 
> important, as we don't know if/when a follow-up KIP for Kafka Streams would 
> land.
>
> I was also thinking (and discussed with a few others) how to expose it, and 
> we would propose the following:
>
> We add a new method to `KafkaStreams` class:
>
>public ClientsInstanceIds clientsInstanceIds(Duration timeout);
>
> The returned object is like below:
>
>  public class ClientsInstanceIds {
>// we only have a single admin client per KS instance
>String adminInstanceId();
>
>// we only have a single global consumer per KS instance (if any)
>// Optional<> because we might not have global-thread
>Optional globalConsumerInstanceId();
>
>// return a  ClientInstanceId> mapping
>// for the underlying (restore-)consumers/producers
>Map mainConsumerInstanceIds();
>Map restoreConsumerInstanceIds();
>Map producerInstanceIds();
> }
>
> For the `threadKey`, we would use some pattern like this:
>
>  [Stream|StateUpdater]Thread-
>
>
> Would this work from your POV?
>
>
>
> -Matthias
>
>
> On 10/9/23 2:15 AM, Andrew Schofield wrote:
>> Hi Matthias,
>> Good point. Makes sense to me.
>> Is this something that can also be included in the proposed Kafka Streams 
>> follow-on KIP, or would you prefer that I add it to KIP-714?
>> I have a slight preference for the former to put all of the KS enhancements 
>> into a separate KIP.
>> Thanks,
>> Andrew
>>> On 7 Oct 2023, at 02:12, Matthias J. Sax  wrote:
>>>
>>> Thanks Andrew. SGTM.
>>>
>>> One point you did not address is the idea to add a method to `KafkaStreams` 
>>> similar to the proposed `clientInstanceId()` that will be added to 
>>> consumer/producer/admin clients.
>>>
>>> Without addressing this, Kafka Streams users won't have a way to get the 
>>> assigned `instanceId` of the internally created clients, and thus it would 
>>> be very difficult for them to know which metrics that the broker receives 
>>> belong to a Kafka Streams app. It seems they would only find the 
>>> `instanceIds` in the log4j output if they enable client logging?
>>>
>>> Of course, because there is multiple clients inside Kafka Streams, the 
>>> return type cannot be an single "String", but must be some some complex 
>>> data structure -- we could either add a new class, or return a 
>>> Map using a client key that maps to the `instanceId`.
>>>
>>> For example we could use the following key:
>>>
>>>   [Global]StreamThread[-][-restore][consumer|producer]
>>>
>>> (Of course, only the valid combination.)
>>>
>>> Or maybe even better, we might want to return a `Future` because collection 
>>> all the `instanceId` might be a blocking all on each client? I have already 
>>> a few idea how it could be implemented but I don't think it must be 
>>> discussed on the KIP, as it's an implementation detail.
>>>
>>> Thoughts?
>>>
>>>
>>> -Matthias
>>>
>>> On 10/6/23 4:21 AM, Andrew Schofield wrote:
 Hi Matthias,
 Thanks for your comments. I agree that a follow-up KIP for Kafka Streams 
 makes sense. This KIP currently has made a bit
 of an effort to embrace KS, but it’s not enough by a long way.
 I have removed `application.id `. This should be 
 done properly in the follow-up KIP. I don’t believe there’s a downside to
 removing it from this KIP.
 I have reworded the statement about temporarily. In practice, the 
 implementation of this KIP that’s going on while the voting
 progresses happens to use delta temporality, but that’s an implementation 
 detail. Supporting clients must support both
 temporalities.
 I thought about exposing the client instance ID as a metric, but 
 non-numeric metrics are not usual practice and tools
 do not universally support them. I don’t think the KIP is improved by 
 adding one now.
 I have also added constants for the various Config classes for 
 ENABLE_METRICS_PUSH_CONFIG, including to
 StreamsConfig. It’s best to be explicit about this.
 Thanks,
 Andrew
> On 2 Oct 2023, at 23:47, Matthias J. Sax  wrote:
>
> Hi,
>
> I did not pay attention to this KIP in the past; seems it was on-hold for 
> a while.
>
> Overall it sounds very useful, and I think we should extend this with a 
> follow up KIP for Kafka Streams. What is unclear to me at this point is 
> the statement:
>
>> Kafka Streams applications have an application.id configured and this 
>> identifier should be included as the application_id metrics label.
>
> The `application.id` is currently only used as the (main) consumer's 
> `group.id` (and is part of an auto-generated `client.id` if the user does 
> not set one).
>

Re: [DISCUSS] KIP-714: Client metrics and observability

2023-10-10 Thread Matthias J. Sax

Andrew,

yes I would like to get this change into KIP-714 right way. Seems to be 
important, as we don't know if/when a follow-up KIP for Kafka Streams 
would land.


I was also thinking (and discussed with a few others) how to expose it, 
and we would propose the following:


We add a new method to `KafkaStreams` class:

public ClientsInstanceIds clientsInstanceIds(Duration timeout);

The returned object is like below:

  public class ClientsInstanceIds {
// we only have a single admin client per KS instance
String adminInstanceId();

// we only have a single global consumer per KS instance (if any)
// Optional<> because we might not have global-thread
Optional globalConsumerInstanceId();

// return a  ClientInstanceId> mapping
// for the underlying (restore-)consumers/producers
Map mainConsumerInstanceIds();
Map restoreConsumerInstanceIds();
Map producerInstanceIds();
}

For the `threadKey`, we would use some pattern like this:

  [Stream|StateUpdater]Thread-


Would this work from your POV?



-Matthias


On 10/9/23 2:15 AM, Andrew Schofield wrote:

Hi Matthias,
Good point. Makes sense to me.

Is this something that can also be included in the proposed Kafka Streams 
follow-on KIP, or would you prefer that I add it to KIP-714?
I have a slight preference for the former to put all of the KS enhancements 
into a separate KIP.

Thanks,
Andrew


On 7 Oct 2023, at 02:12, Matthias J. Sax  wrote:

Thanks Andrew. SGTM.

One point you did not address is the idea to add a method to `KafkaStreams` 
similar to the proposed `clientInstanceId()` that will be added to 
consumer/producer/admin clients.

Without addressing this, Kafka Streams users won't have a way to get the 
assigned `instanceId` of the internally created clients, and thus it would be 
very difficult for them to know which metrics that the broker receives belong 
to a Kafka Streams app. It seems they would only find the `instanceIds` in the 
log4j output if they enable client logging?

Of course, because there is multiple clients inside Kafka Streams, the return type cannot be an 
single "String", but must be some some complex data structure -- we could either add 
a new class, or return a Map using a client key that maps to the 
`instanceId`.

For example we could use the following key:

   [Global]StreamThread[-][-restore][consumer|producer]

(Of course, only the valid combination.)

Or maybe even better, we might want to return a `Future` because collection all 
the `instanceId` might be a blocking all on each client? I have already a few 
idea how it could be implemented but I don't think it must be discussed on the 
KIP, as it's an implementation detail.

Thoughts?


-Matthias

On 10/6/23 4:21 AM, Andrew Schofield wrote:

Hi Matthias,
Thanks for your comments. I agree that a follow-up KIP for Kafka Streams makes 
sense. This KIP currently has made a bit
of an effort to embrace KS, but it’s not enough by a long way.
I have removed `application.id `. This should be done 
properly in the follow-up KIP. I don’t believe there’s a downside to
removing it from this KIP.
I have reworded the statement about temporarily. In practice, the 
implementation of this KIP that’s going on while the voting
progresses happens to use delta temporality, but that’s an implementation 
detail. Supporting clients must support both
temporalities.
I thought about exposing the client instance ID as a metric, but non-numeric 
metrics are not usual practice and tools
do not universally support them. I don’t think the KIP is improved by adding 
one now.
I have also added constants for the various Config classes for 
ENABLE_METRICS_PUSH_CONFIG, including to
StreamsConfig. It’s best to be explicit about this.
Thanks,
Andrew

On 2 Oct 2023, at 23:47, Matthias J. Sax  wrote:

Hi,

I did not pay attention to this KIP in the past; seems it was on-hold for a 
while.

Overall it sounds very useful, and I think we should extend this with a follow 
up KIP for Kafka Streams. What is unclear to me at this point is the statement:


Kafka Streams applications have an application.id configured and this 
identifier should be included as the application_id metrics label.


The `application.id` is currently only used as the (main) consumer's `group.id` 
(and is part of an auto-generated `client.id` if the user does not set one).

This comment related to:


The following labels should be added by the client as appropriate before 
metrics are pushed.


Given that Kafka Streams uses the consumer/producer/admin client as "black 
boxes", a client does at this point not know that it's part of a Kafka Streams 
application, and thus, it won't be able to attach any such label to the metrics it sends. 
(Also producer and admin don't even know the value of `application.id` -- only the (main) 
consumer, indirectly via `group.id`, but also restore and global consumer don't know it, 
because they don't have `group.id` set).

While I am 

Re: [DISCUSS] KIP-714: Client metrics and observability

2023-10-09 Thread Andrew Schofield
Hi Matthias,
Good point. Makes sense to me.

Is this something that can also be included in the proposed Kafka Streams 
follow-on KIP, or would you prefer that I add it to KIP-714?
I have a slight preference for the former to put all of the KS enhancements 
into a separate KIP.

Thanks,
Andrew

> On 7 Oct 2023, at 02:12, Matthias J. Sax  wrote:
>
> Thanks Andrew. SGTM.
>
> One point you did not address is the idea to add a method to `KafkaStreams` 
> similar to the proposed `clientInstanceId()` that will be added to 
> consumer/producer/admin clients.
>
> Without addressing this, Kafka Streams users won't have a way to get the 
> assigned `instanceId` of the internally created clients, and thus it would be 
> very difficult for them to know which metrics that the broker receives belong 
> to a Kafka Streams app. It seems they would only find the `instanceIds` in 
> the log4j output if they enable client logging?
>
> Of course, because there is multiple clients inside Kafka Streams, the return 
> type cannot be an single "String", but must be some some complex data 
> structure -- we could either add a new class, or return a Map 
> using a client key that maps to the `instanceId`.
>
> For example we could use the following key:
>
>   [Global]StreamThread[-][-restore][consumer|producer]
>
> (Of course, only the valid combination.)
>
> Or maybe even better, we might want to return a `Future` because collection 
> all the `instanceId` might be a blocking all on each client? I have already a 
> few idea how it could be implemented but I don't think it must be discussed 
> on the KIP, as it's an implementation detail.
>
> Thoughts?
>
>
> -Matthias
>
> On 10/6/23 4:21 AM, Andrew Schofield wrote:
>> Hi Matthias,
>> Thanks for your comments. I agree that a follow-up KIP for Kafka Streams 
>> makes sense. This KIP currently has made a bit
>> of an effort to embrace KS, but it’s not enough by a long way.
>> I have removed `application.id `. This should be 
>> done properly in the follow-up KIP. I don’t believe there’s a downside to
>> removing it from this KIP.
>> I have reworded the statement about temporarily. In practice, the 
>> implementation of this KIP that’s going on while the voting
>> progresses happens to use delta temporality, but that’s an implementation 
>> detail. Supporting clients must support both
>> temporalities.
>> I thought about exposing the client instance ID as a metric, but non-numeric 
>> metrics are not usual practice and tools
>> do not universally support them. I don’t think the KIP is improved by adding 
>> one now.
>> I have also added constants for the various Config classes for 
>> ENABLE_METRICS_PUSH_CONFIG, including to
>> StreamsConfig. It’s best to be explicit about this.
>> Thanks,
>> Andrew
>>> On 2 Oct 2023, at 23:47, Matthias J. Sax  wrote:
>>>
>>> Hi,
>>>
>>> I did not pay attention to this KIP in the past; seems it was on-hold for a 
>>> while.
>>>
>>> Overall it sounds very useful, and I think we should extend this with a 
>>> follow up KIP for Kafka Streams. What is unclear to me at this point is the 
>>> statement:
>>>
 Kafka Streams applications have an application.id configured and this 
 identifier should be included as the application_id metrics label.
>>>
>>> The `application.id` is currently only used as the (main) consumer's 
>>> `group.id` (and is part of an auto-generated `client.id` if the user does 
>>> not set one).
>>>
>>> This comment related to:
>>>
 The following labels should be added by the client as appropriate before 
 metrics are pushed.
>>>
>>> Given that Kafka Streams uses the consumer/producer/admin client as "black 
>>> boxes", a client does at this point not know that it's part of a Kafka 
>>> Streams application, and thus, it won't be able to attach any such label to 
>>> the metrics it sends. (Also producer and admin don't even know the value of 
>>> `application.id` -- only the (main) consumer, indirectly via `group.id`, 
>>> but also restore and global consumer don't know it, because they don't have 
>>> `group.id` set).
>>>
>>> While I am totally in favor of the proposal, I am wondering how we intent 
>>> to implement it in clean way? Or would we do ok to have some internal 
>>> client APIs that KS can use to "register" itself with the client?
>>>
>>>
>>>
 While clients must support both temporalities, the broker will initially 
 only send GetTelemetrySubscriptionsResponse.DeltaTemporality=True
>>>
>>> Not sure if I can follow. How make the decision about DELTA or CUMULATIVE 
>>> metrics? Should the broker side plugin not decide what metrics it what to 
>>> receive in which form? So what does "initially" mean -- the broker won't 
>>> ship with a default plugin implementation?
>>>
>>>
>>>
 The following method is added to the Producer, Consumer, and Admin client 
 interfaces:
>>>
>>> Should we add anything to Kafka Streams to expose the underlying clients' 
>>> assigned 

Re: [DISCUSS] KIP-714: Client metrics and observability

2023-10-06 Thread Matthias J. Sax

Thanks Andrew. SGTM.

One point you did not address is the idea to add a method to 
`KafkaStreams` similar to the proposed `clientInstanceId()` that will be 
added to consumer/producer/admin clients.


Without addressing this, Kafka Streams users won't have a way to get the 
assigned `instanceId` of the internally created clients, and thus it 
would be very difficult for them to know which metrics that the broker 
receives belong to a Kafka Streams app. It seems they would only find 
the `instanceIds` in the log4j output if they enable client logging?


Of course, because there is multiple clients inside Kafka Streams, the 
return type cannot be an single "String", but must be some some complex 
data structure -- we could either add a new class, or return a 
Map using a client key that maps to the `instanceId`.


For example we could use the following key:

   [Global]StreamThread[-][-restore][consumer|producer]

(Of course, only the valid combination.)

Or maybe even better, we might want to return a `Future` because 
collection all the `instanceId` might be a blocking all on each client? 
I have already a few idea how it could be implemented but I don't think 
it must be discussed on the KIP, as it's an implementation detail.


Thoughts?


-Matthias

On 10/6/23 4:21 AM, Andrew Schofield wrote:

Hi Matthias,
Thanks for your comments. I agree that a follow-up KIP for Kafka Streams makes 
sense. This KIP currently has made a bit
of an effort to embrace KS, but it’s not enough by a long way.

I have removed `application.id `. This should be done 
properly in the follow-up KIP. I don’t believe there’s a downside to
removing it from this KIP.

I have reworded the statement about temporarily. In practice, the 
implementation of this KIP that’s going on while the voting
progresses happens to use delta temporality, but that’s an implementation 
detail. Supporting clients must support both
temporalities.

I thought about exposing the client instance ID as a metric, but non-numeric 
metrics are not usual practice and tools
do not universally support them. I don’t think the KIP is improved by adding 
one now.

I have also added constants for the various Config classes for 
ENABLE_METRICS_PUSH_CONFIG, including to
StreamsConfig. It’s best to be explicit about this.

Thanks,
Andrew


On 2 Oct 2023, at 23:47, Matthias J. Sax  wrote:

Hi,

I did not pay attention to this KIP in the past; seems it was on-hold for a 
while.

Overall it sounds very useful, and I think we should extend this with a follow 
up KIP for Kafka Streams. What is unclear to me at this point is the statement:


Kafka Streams applications have an application.id configured and this 
identifier should be included as the application_id metrics label.


The `application.id` is currently only used as the (main) consumer's `group.id` 
(and is part of an auto-generated `client.id` if the user does not set one).

This comment related to:


The following labels should be added by the client as appropriate before 
metrics are pushed.


Given that Kafka Streams uses the consumer/producer/admin client as "black 
boxes", a client does at this point not know that it's part of a Kafka Streams 
application, and thus, it won't be able to attach any such label to the metrics it sends. 
(Also producer and admin don't even know the value of `application.id` -- only the (main) 
consumer, indirectly via `group.id`, but also restore and global consumer don't know it, 
because they don't have `group.id` set).

While I am totally in favor of the proposal, I am wondering how we intent to implement it 
in clean way? Or would we do ok to have some internal client APIs that KS can use to 
"register" itself with the client?




While clients must support both temporalities, the broker will initially only 
send GetTelemetrySubscriptionsResponse.DeltaTemporality=True


Not sure if I can follow. How make the decision about DELTA or CUMULATIVE metrics? Should 
the broker side plugin not decide what metrics it what to receive in which form? So what 
does "initially" mean -- the broker won't ship with a default plugin 
implementation?




The following method is added to the Producer, Consumer, and Admin client 
interfaces:


Should we add anything to Kafka Streams to expose the underlying clients' 
assigned client-instance-ids programmatically? I am also wondering if clients 
should report their assigned client-instance-ids as metrics itself (for this 
case, Kafka Streams won't need to do anything, because we already expose all 
client metrics).

If we add anything programmatic, we need to make it simple, given that Kafka 
Streams has many clients per `StreamThread` and may have multiple threads.




enable.metrics.push

It might be worth to add this to `StreamsConfig`, too? It set via 
StreamsConfig, we would forward it to all clients automatically.




-Matthias


On 9/29/23 5:45 PM, David Jacot wrote:

Hi Andrew,
Thanks for driving this one. I haven't read 

Re: [DISCUSS] KIP-714: Client metrics and observability

2023-10-06 Thread Andrew Schofield
Hi Matthias,
Thanks for your comments. I agree that a follow-up KIP for Kafka Streams makes 
sense. This KIP currently has made a bit
of an effort to embrace KS, but it’s not enough by a long way.

I have removed `application.id `. This should be done 
properly in the follow-up KIP. I don’t believe there’s a downside to
removing it from this KIP.

I have reworded the statement about temporarily. In practice, the 
implementation of this KIP that’s going on while the voting
progresses happens to use delta temporality, but that’s an implementation 
detail. Supporting clients must support both
temporalities.

I thought about exposing the client instance ID as a metric, but non-numeric 
metrics are not usual practice and tools
do not universally support them. I don’t think the KIP is improved by adding 
one now.

I have also added constants for the various Config classes for 
ENABLE_METRICS_PUSH_CONFIG, including to
StreamsConfig. It’s best to be explicit about this.

Thanks,
Andrew

> On 2 Oct 2023, at 23:47, Matthias J. Sax  wrote:
>
> Hi,
>
> I did not pay attention to this KIP in the past; seems it was on-hold for a 
> while.
>
> Overall it sounds very useful, and I think we should extend this with a 
> follow up KIP for Kafka Streams. What is unclear to me at this point is the 
> statement:
>
>> Kafka Streams applications have an application.id configured and this 
>> identifier should be included as the application_id metrics label.
>
> The `application.id` is currently only used as the (main) consumer's 
> `group.id` (and is part of an auto-generated `client.id` if the user does not 
> set one).
>
> This comment related to:
>
>> The following labels should be added by the client as appropriate before 
>> metrics are pushed.
>
> Given that Kafka Streams uses the consumer/producer/admin client as "black 
> boxes", a client does at this point not know that it's part of a Kafka 
> Streams application, and thus, it won't be able to attach any such label to 
> the metrics it sends. (Also producer and admin don't even know the value of 
> `application.id` -- only the (main) consumer, indirectly via `group.id`, but 
> also restore and global consumer don't know it, because they don't have 
> `group.id` set).
>
> While I am totally in favor of the proposal, I am wondering how we intent to 
> implement it in clean way? Or would we do ok to have some internal client 
> APIs that KS can use to "register" itself with the client?
>
>
>
>> While clients must support both temporalities, the broker will initially 
>> only send GetTelemetrySubscriptionsResponse.DeltaTemporality=True
>
> Not sure if I can follow. How make the decision about DELTA or CUMULATIVE 
> metrics? Should the broker side plugin not decide what metrics it what to 
> receive in which form? So what does "initially" mean -- the broker won't ship 
> with a default plugin implementation?
>
>
>
>> The following method is added to the Producer, Consumer, and Admin client 
>> interfaces:
>
> Should we add anything to Kafka Streams to expose the underlying clients' 
> assigned client-instance-ids programmatically? I am also wondering if clients 
> should report their assigned client-instance-ids as metrics itself (for this 
> case, Kafka Streams won't need to do anything, because we already expose all 
> client metrics).
>
> If we add anything programmatic, we need to make it simple, given that Kafka 
> Streams has many clients per `StreamThread` and may have multiple threads.
>
>
>
>> enable.metrics.push
> It might be worth to add this to `StreamsConfig`, too? It set via 
> StreamsConfig, we would forward it to all clients automatically.
>
>
>
>
> -Matthias
>
>
> On 9/29/23 5:45 PM, David Jacot wrote:
>> Hi Andrew,
>> Thanks for driving this one. I haven't read all the KIP yet but I already
>> have an initial question. In the Threading section, it is written
>> "KafkaConsumer: the "background" thread (based on the consumer threading
>> refactor which is underway)". If I understand this correctly, it means
>> that KIP-714 won't work if the "old consumer" is used. Am I correct?
>> Cheers,
>> David
>> On Fri, Sep 22, 2023 at 12:18 PM Andrew Schofield <
>> andrew_schofield_j...@outlook.com> wrote:
>>> Hi Philip,
>>> No, I do not think it should actively search for a broker that supports
>>> the new
>>> RPCs. In general, either all of the brokers or none of the brokers will
>>> support it.
>>> In the window, where the cluster is being upgraded or client telemetry is
>>> being
>>> enabled, there might be a mixed situation. I wouldn’t put too much effort
>>> into
>>> this mixed scenario. As the client finds brokers which support the new
>>> RPCs,
>>> it can begin to follow the KIP-714 mechanism.
>>>
>>> Thanks,
>>> Andrew
>>>
 On 22 Sep 2023, at 20:01, Philip Nee  wrote:

 Hi Andrew -

 Question on top of your answers: Do you think the client should actively
 search for a broker that supports this RPC? As previously 

Re: [DISCUSS] KIP-714: Client metrics and observability

2023-10-03 Thread Andrew Schofield
Hi David,
Thanks for your interest in KIP-714.

Because this KIP is under development at the same time as KIP-848, it will
need to support both the existing KafkaConsumer code and the refactored code
being worked on under KIP-848. I’ve updated the Threading section accordingly.

Thanks,
Andrew

> On 30 Sep 2023, at 01:45, David Jacot  wrote:
>
> Hi Andrew,
>
> Thanks for driving this one. I haven't read all the KIP yet but I already
> have an initial question. In the Threading section, it is written
> "KafkaConsumer: the "background" thread (based on the consumer threading
> refactor which is underway)". If I understand this correctly, it means
> that KIP-714 won't work if the "old consumer" is used. Am I correct?
>
> Cheers,
> David
>
>
> On Fri, Sep 22, 2023 at 12:18 PM Andrew Schofield <
> andrew_schofield_j...@outlook.com> wrote:
>
>> Hi Philip,
>> No, I do not think it should actively search for a broker that supports
>> the new
>> RPCs. In general, either all of the brokers or none of the brokers will
>> support it.
>> In the window, where the cluster is being upgraded or client telemetry is
>> being
>> enabled, there might be a mixed situation. I wouldn’t put too much effort
>> into
>> this mixed scenario. As the client finds brokers which support the new
>> RPCs,
>> it can begin to follow the KIP-714 mechanism.
>>
>> Thanks,
>> Andrew
>>
>>> On 22 Sep 2023, at 20:01, Philip Nee  wrote:
>>>
>>> Hi Andrew -
>>>
>>> Question on top of your answers: Do you think the client should actively
>>> search for a broker that supports this RPC? As previously mentioned, the
>>> broker uses the leastLoadedNode to find its first connection (am
>>> I correct?), and what if that broker doesn't support the metric push?
>>>
>>> P
>>>
>>> On Fri, Sep 22, 2023 at 10:20 AM Andrew Schofield <
>>> andrew_schofield_j...@outlook.com> wrote:
>>>
 Hi Kirk,
 Thanks for your question. You are correct that the presence or absence
>> of
 the new RPCs in the
 ApiVersionsResponse tells the client whether to request the telemetry
 subscriptions and push
 metrics.

 This is of course tricky in practice. It would be conceivable, as a
 cluster is upgraded to AK 3.7
 or as a client metrics receiver plugin is deployed across the cluster,
 that a client connects to some
 brokers that support the new RPCs and some that do not.

 Here’s my suggestion:
 * If a client is not connected to any brokers that support in the new
 RPCs, it cannot push metrics.
 * If a client is only connected to brokers that support the new RPCs, it
 will use the new RPCs in
 accordance with the KIP.
 * If a client is connected to some brokers that support the new RPCs and
 some that do not, it will
 use the new RPCs with the supporting subset of brokers in accordance
>> with
 the KIP.

 Comments?

 Thanks,
 Andrew

> On 22 Sep 2023, at 16:01, Kirk True  wrote:
>
> Hi Andrew/Jun,
>
> I want to make sure I understand question/comment #119… In the case
 where a cluster without a metrics client receiver is later reconfigured
>> and
 restarted to include a metrics client receiver, do we want the client to
 thereafter begin pushing metrics to the cluster? From Andrew’s response
>> to
 question #119, it sounds like we’re using the presence/absence of the
 relevant RPCs in ApiVersionsResponse as the to-push-or-not-to-push
 indicator. Do I have that correct?
>
> Thanks,
> Kirk
>
>> On Sep 21, 2023, at 7:42 AM, Andrew Schofield <
 andrew_schofield_j...@outlook.com> wrote:
>>
>> Hi Jun,
>> Thanks for your comments. I’ve updated the KIP to clarify where
 necessary.
>>
>> 110. Yes, agree. The motivation section mentions this.
>>
>> 111. The replacement of ‘-‘ with ‘.’ for metric names and the
 replacement of
>> ‘-‘ with ‘_’ for attribute keys is following the OTLP guidelines. I
 think it’s a bit
>> of a debatable point. OTLP makes a distinction between a namespace
>> and a
>> multi-word component. If it was “client.id” then “client” would be a
 namespace with
>> an attribute key “id”. But “client_id” is just a key. So, it was
 intentional, but debatable.
>>
>> 112. Thanks. The link target moved. Fixed.
>>
>> 113. Thanks. Fixed.
>>
>> 114.1. If a standard metric makes sense for a client, it should use
>> the
 exact same
>> name. If a standard metric doesn’t make sense for a client, then it
>> can
 omit that metric.
>>
>> For a required metric, the situation is stronger. All clients must
 implement these
>> metrics with these names in order to implement the KIP. But the
 required metrics
>> are essentially the number of connections and the request latency,
 which do not
>> reference the underlying implementation of the client (which
 producer.record.queue.time.max

Re: [DISCUSS] KIP-714: Client metrics and observability

2023-10-02 Thread Matthias J. Sax

Hi,

I did not pay attention to this KIP in the past; seems it was on-hold 
for a while.


Overall it sounds very useful, and I think we should extend this with a 
follow up KIP for Kafka Streams. What is unclear to me at this point is 
the statement:



Kafka Streams applications have an application.id configured and this 
identifier should be included as the application_id metrics label.


The `application.id` is currently only used as the (main) consumer's 
`group.id` (and is part of an auto-generated `client.id` if the user 
does not set one).


This comment related to:


The following labels should be added by the client as appropriate before 
metrics are pushed.


Given that Kafka Streams uses the consumer/producer/admin client as 
"black boxes", a client does at this point not know that it's part of a 
Kafka Streams application, and thus, it won't be able to attach any such 
label to the metrics it sends. (Also producer and admin don't even know 
the value of `application.id` -- only the (main) consumer, indirectly 
via `group.id`, but also restore and global consumer don't know it, 
because they don't have `group.id` set).


While I am totally in favor of the proposal, I am wondering how we 
intent to implement it in clean way? Or would we do ok to have some 
internal client APIs that KS can use to "register" itself with the client?





While clients must support both temporalities, the broker will initially only 
send GetTelemetrySubscriptionsResponse.DeltaTemporality=True


Not sure if I can follow. How make the decision about DELTA or 
CUMULATIVE metrics? Should the broker side plugin not decide what 
metrics it what to receive in which form? So what does "initially" mean 
-- the broker won't ship with a default plugin implementation?





The following method is added to the Producer, Consumer, and Admin client 
interfaces:


Should we add anything to Kafka Streams to expose the underlying 
clients' assigned client-instance-ids programmatically? I am also 
wondering if clients should report their assigned client-instance-ids as 
metrics itself (for this case, Kafka Streams won't need to do anything, 
because we already expose all client metrics).


If we add anything programmatic, we need to make it simple, given that 
Kafka Streams has many clients per `StreamThread` and may have multiple 
threads.





enable.metrics.push
It might be worth to add this to `StreamsConfig`, too? It set via 
StreamsConfig, we would forward it to all clients automatically.





-Matthias


On 9/29/23 5:45 PM, David Jacot wrote:

Hi Andrew,

Thanks for driving this one. I haven't read all the KIP yet but I already
have an initial question. In the Threading section, it is written
"KafkaConsumer: the "background" thread (based on the consumer threading
refactor which is underway)". If I understand this correctly, it means
that KIP-714 won't work if the "old consumer" is used. Am I correct?

Cheers,
David


On Fri, Sep 22, 2023 at 12:18 PM Andrew Schofield <
andrew_schofield_j...@outlook.com> wrote:


Hi Philip,
No, I do not think it should actively search for a broker that supports
the new
RPCs. In general, either all of the brokers or none of the brokers will
support it.
In the window, where the cluster is being upgraded or client telemetry is
being
enabled, there might be a mixed situation. I wouldn’t put too much effort
into
this mixed scenario. As the client finds brokers which support the new
RPCs,
it can begin to follow the KIP-714 mechanism.

Thanks,
Andrew


On 22 Sep 2023, at 20:01, Philip Nee  wrote:

Hi Andrew -

Question on top of your answers: Do you think the client should actively
search for a broker that supports this RPC? As previously mentioned, the
broker uses the leastLoadedNode to find its first connection (am
I correct?), and what if that broker doesn't support the metric push?

P

On Fri, Sep 22, 2023 at 10:20 AM Andrew Schofield <
andrew_schofield_j...@outlook.com> wrote:


Hi Kirk,
Thanks for your question. You are correct that the presence or absence

of

the new RPCs in the
ApiVersionsResponse tells the client whether to request the telemetry
subscriptions and push
metrics.

This is of course tricky in practice. It would be conceivable, as a
cluster is upgraded to AK 3.7
or as a client metrics receiver plugin is deployed across the cluster,
that a client connects to some
brokers that support the new RPCs and some that do not.

Here’s my suggestion:
* If a client is not connected to any brokers that support in the new
RPCs, it cannot push metrics.
* If a client is only connected to brokers that support the new RPCs, it
will use the new RPCs in
accordance with the KIP.
* If a client is connected to some brokers that support the new RPCs and
some that do not, it will
use the new RPCs with the supporting subset of brokers in accordance

with

the KIP.

Comments?

Thanks,
Andrew


On 22 Sep 2023, at 16:01, Kirk True  wrote:

Hi Andrew/Jun,

I want to make sure I understand question/comment 

Re: [DISCUSS] KIP-714: Client metrics and observability

2023-09-29 Thread David Jacot
Hi Andrew,

Thanks for driving this one. I haven't read all the KIP yet but I already
have an initial question. In the Threading section, it is written
"KafkaConsumer: the "background" thread (based on the consumer threading
refactor which is underway)". If I understand this correctly, it means
that KIP-714 won't work if the "old consumer" is used. Am I correct?

Cheers,
David


On Fri, Sep 22, 2023 at 12:18 PM Andrew Schofield <
andrew_schofield_j...@outlook.com> wrote:

> Hi Philip,
> No, I do not think it should actively search for a broker that supports
> the new
> RPCs. In general, either all of the brokers or none of the brokers will
> support it.
> In the window, where the cluster is being upgraded or client telemetry is
> being
> enabled, there might be a mixed situation. I wouldn’t put too much effort
> into
> this mixed scenario. As the client finds brokers which support the new
> RPCs,
> it can begin to follow the KIP-714 mechanism.
>
> Thanks,
> Andrew
>
> > On 22 Sep 2023, at 20:01, Philip Nee  wrote:
> >
> > Hi Andrew -
> >
> > Question on top of your answers: Do you think the client should actively
> > search for a broker that supports this RPC? As previously mentioned, the
> > broker uses the leastLoadedNode to find its first connection (am
> > I correct?), and what if that broker doesn't support the metric push?
> >
> > P
> >
> > On Fri, Sep 22, 2023 at 10:20 AM Andrew Schofield <
> > andrew_schofield_j...@outlook.com> wrote:
> >
> >> Hi Kirk,
> >> Thanks for your question. You are correct that the presence or absence
> of
> >> the new RPCs in the
> >> ApiVersionsResponse tells the client whether to request the telemetry
> >> subscriptions and push
> >> metrics.
> >>
> >> This is of course tricky in practice. It would be conceivable, as a
> >> cluster is upgraded to AK 3.7
> >> or as a client metrics receiver plugin is deployed across the cluster,
> >> that a client connects to some
> >> brokers that support the new RPCs and some that do not.
> >>
> >> Here’s my suggestion:
> >> * If a client is not connected to any brokers that support in the new
> >> RPCs, it cannot push metrics.
> >> * If a client is only connected to brokers that support the new RPCs, it
> >> will use the new RPCs in
> >> accordance with the KIP.
> >> * If a client is connected to some brokers that support the new RPCs and
> >> some that do not, it will
> >> use the new RPCs with the supporting subset of brokers in accordance
> with
> >> the KIP.
> >>
> >> Comments?
> >>
> >> Thanks,
> >> Andrew
> >>
> >>> On 22 Sep 2023, at 16:01, Kirk True  wrote:
> >>>
> >>> Hi Andrew/Jun,
> >>>
> >>> I want to make sure I understand question/comment #119… In the case
> >> where a cluster without a metrics client receiver is later reconfigured
> and
> >> restarted to include a metrics client receiver, do we want the client to
> >> thereafter begin pushing metrics to the cluster? From Andrew’s response
> to
> >> question #119, it sounds like we’re using the presence/absence of the
> >> relevant RPCs in ApiVersionsResponse as the to-push-or-not-to-push
> >> indicator. Do I have that correct?
> >>>
> >>> Thanks,
> >>> Kirk
> >>>
>  On Sep 21, 2023, at 7:42 AM, Andrew Schofield <
> >> andrew_schofield_j...@outlook.com> wrote:
> 
>  Hi Jun,
>  Thanks for your comments. I’ve updated the KIP to clarify where
> >> necessary.
> 
>  110. Yes, agree. The motivation section mentions this.
> 
>  111. The replacement of ‘-‘ with ‘.’ for metric names and the
> >> replacement of
>  ‘-‘ with ‘_’ for attribute keys is following the OTLP guidelines. I
> >> think it’s a bit
>  of a debatable point. OTLP makes a distinction between a namespace
> and a
>  multi-word component. If it was “client.id” then “client” would be a
> >> namespace with
>  an attribute key “id”. But “client_id” is just a key. So, it was
> >> intentional, but debatable.
> 
>  112. Thanks. The link target moved. Fixed.
> 
>  113. Thanks. Fixed.
> 
>  114.1. If a standard metric makes sense for a client, it should use
> the
> >> exact same
>  name. If a standard metric doesn’t make sense for a client, then it
> can
> >> omit that metric.
> 
>  For a required metric, the situation is stronger. All clients must
> >> implement these
>  metrics with these names in order to implement the KIP. But the
> >> required metrics
>  are essentially the number of connections and the request latency,
> >> which do not
>  reference the underlying implementation of the client (which
> >> producer.record.queue.time.max
>  of course does).
> 
>  I suppose someone might build a producer-only client that didn’t have
> >> consumer metrics.
>  In this case, the consumer metrics would conceptually have the value 0
> >> and would not
>  need to be sent to the broker.
> 
>  114.2. If a client does not implement some metrics, they will not be
> >> available for
>  analysis and 

Re: [DISCUSS] KIP-714: Client metrics and observability

2023-09-22 Thread Andrew Schofield
Hi Philip,
No, I do not think it should actively search for a broker that supports the new
RPCs. In general, either all of the brokers or none of the brokers will support 
it.
In the window, where the cluster is being upgraded or client telemetry is being
enabled, there might be a mixed situation. I wouldn’t put too much effort into
this mixed scenario. As the client finds brokers which support the new RPCs,
it can begin to follow the KIP-714 mechanism.

Thanks,
Andrew

> On 22 Sep 2023, at 20:01, Philip Nee  wrote:
>
> Hi Andrew -
>
> Question on top of your answers: Do you think the client should actively
> search for a broker that supports this RPC? As previously mentioned, the
> broker uses the leastLoadedNode to find its first connection (am
> I correct?), and what if that broker doesn't support the metric push?
>
> P
>
> On Fri, Sep 22, 2023 at 10:20 AM Andrew Schofield <
> andrew_schofield_j...@outlook.com> wrote:
>
>> Hi Kirk,
>> Thanks for your question. You are correct that the presence or absence of
>> the new RPCs in the
>> ApiVersionsResponse tells the client whether to request the telemetry
>> subscriptions and push
>> metrics.
>>
>> This is of course tricky in practice. It would be conceivable, as a
>> cluster is upgraded to AK 3.7
>> or as a client metrics receiver plugin is deployed across the cluster,
>> that a client connects to some
>> brokers that support the new RPCs and some that do not.
>>
>> Here’s my suggestion:
>> * If a client is not connected to any brokers that support in the new
>> RPCs, it cannot push metrics.
>> * If a client is only connected to brokers that support the new RPCs, it
>> will use the new RPCs in
>> accordance with the KIP.
>> * If a client is connected to some brokers that support the new RPCs and
>> some that do not, it will
>> use the new RPCs with the supporting subset of brokers in accordance with
>> the KIP.
>>
>> Comments?
>>
>> Thanks,
>> Andrew
>>
>>> On 22 Sep 2023, at 16:01, Kirk True  wrote:
>>>
>>> Hi Andrew/Jun,
>>>
>>> I want to make sure I understand question/comment #119… In the case
>> where a cluster without a metrics client receiver is later reconfigured and
>> restarted to include a metrics client receiver, do we want the client to
>> thereafter begin pushing metrics to the cluster? From Andrew’s response to
>> question #119, it sounds like we’re using the presence/absence of the
>> relevant RPCs in ApiVersionsResponse as the to-push-or-not-to-push
>> indicator. Do I have that correct?
>>>
>>> Thanks,
>>> Kirk
>>>
 On Sep 21, 2023, at 7:42 AM, Andrew Schofield <
>> andrew_schofield_j...@outlook.com> wrote:

 Hi Jun,
 Thanks for your comments. I’ve updated the KIP to clarify where
>> necessary.

 110. Yes, agree. The motivation section mentions this.

 111. The replacement of ‘-‘ with ‘.’ for metric names and the
>> replacement of
 ‘-‘ with ‘_’ for attribute keys is following the OTLP guidelines. I
>> think it’s a bit
 of a debatable point. OTLP makes a distinction between a namespace and a
 multi-word component. If it was “client.id” then “client” would be a
>> namespace with
 an attribute key “id”. But “client_id” is just a key. So, it was
>> intentional, but debatable.

 112. Thanks. The link target moved. Fixed.

 113. Thanks. Fixed.

 114.1. If a standard metric makes sense for a client, it should use the
>> exact same
 name. If a standard metric doesn’t make sense for a client, then it can
>> omit that metric.

 For a required metric, the situation is stronger. All clients must
>> implement these
 metrics with these names in order to implement the KIP. But the
>> required metrics
 are essentially the number of connections and the request latency,
>> which do not
 reference the underlying implementation of the client (which
>> producer.record.queue.time.max
 of course does).

 I suppose someone might build a producer-only client that didn’t have
>> consumer metrics.
 In this case, the consumer metrics would conceptually have the value 0
>> and would not
 need to be sent to the broker.

 114.2. If a client does not implement some metrics, they will not be
>> available for
 analysis and troubleshooting. It just makes the ability to combine
>> metrics from lots
 different clients less complete.

 115. I think it was probably a mistake to be so specific about
>> threading in this KIP.
 When the consumer threading refactor is complete, of course, it would
>> do the appropriate
 equivalent. I’ve added a clarification and massively simplified this
>> section.

 116. I removed “client.terminating”.

 117. Yes. Horrid. Fixed.

 118. The Terminating flag just indicates that this is the final
>> PushTelemetryRequest
 from this client. Any subsequent request will be rejected. I think this
>> flag should remain.

 119. Good catch. This was actually contradicting 

Re: [DISCUSS] KIP-714: Client metrics and observability

2023-09-22 Thread Philip Nee
Hi Andrew -

Question on top of your answers: Do you think the client should actively
search for a broker that supports this RPC? As previously mentioned, the
broker uses the leastLoadedNode to find its first connection (am
I correct?), and what if that broker doesn't support the metric push?

P

On Fri, Sep 22, 2023 at 10:20 AM Andrew Schofield <
andrew_schofield_j...@outlook.com> wrote:

> Hi Kirk,
> Thanks for your question. You are correct that the presence or absence of
> the new RPCs in the
> ApiVersionsResponse tells the client whether to request the telemetry
> subscriptions and push
> metrics.
>
> This is of course tricky in practice. It would be conceivable, as a
> cluster is upgraded to AK 3.7
> or as a client metrics receiver plugin is deployed across the cluster,
> that a client connects to some
> brokers that support the new RPCs and some that do not.
>
> Here’s my suggestion:
> * If a client is not connected to any brokers that support in the new
> RPCs, it cannot push metrics.
> * If a client is only connected to brokers that support the new RPCs, it
> will use the new RPCs in
> accordance with the KIP.
> * If a client is connected to some brokers that support the new RPCs and
> some that do not, it will
> use the new RPCs with the supporting subset of brokers in accordance with
> the KIP.
>
> Comments?
>
> Thanks,
> Andrew
>
> > On 22 Sep 2023, at 16:01, Kirk True  wrote:
> >
> > Hi Andrew/Jun,
> >
> > I want to make sure I understand question/comment #119… In the case
> where a cluster without a metrics client receiver is later reconfigured and
> restarted to include a metrics client receiver, do we want the client to
> thereafter begin pushing metrics to the cluster? From Andrew’s response to
> question #119, it sounds like we’re using the presence/absence of the
> relevant RPCs in ApiVersionsResponse as the to-push-or-not-to-push
> indicator. Do I have that correct?
> >
> > Thanks,
> > Kirk
> >
> >> On Sep 21, 2023, at 7:42 AM, Andrew Schofield <
> andrew_schofield_j...@outlook.com> wrote:
> >>
> >> Hi Jun,
> >> Thanks for your comments. I’ve updated the KIP to clarify where
> necessary.
> >>
> >> 110. Yes, agree. The motivation section mentions this.
> >>
> >> 111. The replacement of ‘-‘ with ‘.’ for metric names and the
> replacement of
> >> ‘-‘ with ‘_’ for attribute keys is following the OTLP guidelines. I
> think it’s a bit
> >> of a debatable point. OTLP makes a distinction between a namespace and a
> >> multi-word component. If it was “client.id” then “client” would be a
> namespace with
> >> an attribute key “id”. But “client_id” is just a key. So, it was
> intentional, but debatable.
> >>
> >> 112. Thanks. The link target moved. Fixed.
> >>
> >> 113. Thanks. Fixed.
> >>
> >> 114.1. If a standard metric makes sense for a client, it should use the
> exact same
> >> name. If a standard metric doesn’t make sense for a client, then it can
> omit that metric.
> >>
> >> For a required metric, the situation is stronger. All clients must
> implement these
> >> metrics with these names in order to implement the KIP. But the
> required metrics
> >> are essentially the number of connections and the request latency,
> which do not
> >> reference the underlying implementation of the client (which
> producer.record.queue.time.max
> >> of course does).
> >>
> >> I suppose someone might build a producer-only client that didn’t have
> consumer metrics.
> >> In this case, the consumer metrics would conceptually have the value 0
> and would not
> >> need to be sent to the broker.
> >>
> >> 114.2. If a client does not implement some metrics, they will not be
> available for
> >> analysis and troubleshooting. It just makes the ability to combine
> metrics from lots
> >> different clients less complete.
> >>
> >> 115. I think it was probably a mistake to be so specific about
> threading in this KIP.
> >> When the consumer threading refactor is complete, of course, it would
> do the appropriate
> >> equivalent. I’ve added a clarification and massively simplified this
> section.
> >>
> >> 116. I removed “client.terminating”.
> >>
> >> 117. Yes. Horrid. Fixed.
> >>
> >> 118. The Terminating flag just indicates that this is the final
> PushTelemetryRequest
> >> from this client. Any subsequent request will be rejected. I think this
> flag should remain.
> >>
> >> 119. Good catch. This was actually contradicting another part of the
> KIP. The current behaviour
> >> is indeed preserved. If the broker doesn’t have a client metrics
> receiver plugin, the new RPCs
> >> in this KIP are “turned off” and not reported in ApiVersionsResponse.
> The client will not
> >> attempt to push metrics.
> >>
> >> 120. The error handling table lists the error codes for
> PushTelemetryResponse. I’ve added one
> >> but it looked good to me. GetTelemetrySubscriptions doesn’t have any
> error codes, since the
> >> situation in which the client telemetry is not supported is handled by
> the RPCs not being offered
> >> by the 

Re: [DISCUSS] KIP-714: Client metrics and observability

2023-09-22 Thread Andrew Schofield
Hi Kirk,
Thanks for your question. You are correct that the presence or absence of the 
new RPCs in the
ApiVersionsResponse tells the client whether to request the telemetry 
subscriptions and push
metrics.

This is of course tricky in practice. It would be conceivable, as a cluster is 
upgraded to AK 3.7
or as a client metrics receiver plugin is deployed across the cluster, that a 
client connects to some
brokers that support the new RPCs and some that do not.

Here’s my suggestion:
* If a client is not connected to any brokers that support in the new RPCs, it 
cannot push metrics.
* If a client is only connected to brokers that support the new RPCs, it will 
use the new RPCs in
accordance with the KIP.
* If a client is connected to some brokers that support the new RPCs and some 
that do not, it will
use the new RPCs with the supporting subset of brokers in accordance with the 
KIP.

Comments?

Thanks,
Andrew

> On 22 Sep 2023, at 16:01, Kirk True  wrote:
>
> Hi Andrew/Jun,
>
> I want to make sure I understand question/comment #119… In the case where a 
> cluster without a metrics client receiver is later reconfigured and restarted 
> to include a metrics client receiver, do we want the client to thereafter 
> begin pushing metrics to the cluster? From Andrew’s response to question 
> #119, it sounds like we’re using the presence/absence of the relevant RPCs in 
> ApiVersionsResponse as the to-push-or-not-to-push indicator. Do I have that 
> correct?
>
> Thanks,
> Kirk
>
>> On Sep 21, 2023, at 7:42 AM, Andrew Schofield 
>>  wrote:
>>
>> Hi Jun,
>> Thanks for your comments. I’ve updated the KIP to clarify where necessary.
>>
>> 110. Yes, agree. The motivation section mentions this.
>>
>> 111. The replacement of ‘-‘ with ‘.’ for metric names and the replacement of
>> ‘-‘ with ‘_’ for attribute keys is following the OTLP guidelines. I think 
>> it’s a bit
>> of a debatable point. OTLP makes a distinction between a namespace and a
>> multi-word component. If it was “client.id” then “client” would be a 
>> namespace with
>> an attribute key “id”. But “client_id” is just a key. So, it was 
>> intentional, but debatable.
>>
>> 112. Thanks. The link target moved. Fixed.
>>
>> 113. Thanks. Fixed.
>>
>> 114.1. If a standard metric makes sense for a client, it should use the 
>> exact same
>> name. If a standard metric doesn’t make sense for a client, then it can omit 
>> that metric.
>>
>> For a required metric, the situation is stronger. All clients must implement 
>> these
>> metrics with these names in order to implement the KIP. But the required 
>> metrics
>> are essentially the number of connections and the request latency, which do 
>> not
>> reference the underlying implementation of the client (which 
>> producer.record.queue.time.max
>> of course does).
>>
>> I suppose someone might build a producer-only client that didn’t have 
>> consumer metrics.
>> In this case, the consumer metrics would conceptually have the value 0 and 
>> would not
>> need to be sent to the broker.
>>
>> 114.2. If a client does not implement some metrics, they will not be 
>> available for
>> analysis and troubleshooting. It just makes the ability to combine metrics 
>> from lots
>> different clients less complete.
>>
>> 115. I think it was probably a mistake to be so specific about threading in 
>> this KIP.
>> When the consumer threading refactor is complete, of course, it would do the 
>> appropriate
>> equivalent. I’ve added a clarification and massively simplified this section.
>>
>> 116. I removed “client.terminating”.
>>
>> 117. Yes. Horrid. Fixed.
>>
>> 118. The Terminating flag just indicates that this is the final 
>> PushTelemetryRequest
>> from this client. Any subsequent request will be rejected. I think this flag 
>> should remain.
>>
>> 119. Good catch. This was actually contradicting another part of the KIP. 
>> The current behaviour
>> is indeed preserved. If the broker doesn’t have a client metrics receiver 
>> plugin, the new RPCs
>> in this KIP are “turned off” and not reported in ApiVersionsResponse. The 
>> client will not
>> attempt to push metrics.
>>
>> 120. The error handling table lists the error codes for 
>> PushTelemetryResponse. I’ve added one
>> but it looked good to me. GetTelemetrySubscriptions doesn’t have any error 
>> codes, since the
>> situation in which the client telemetry is not supported is handled by the 
>> RPCs not being offered
>> by the broker.
>>
>> 121. Again, I think it’s probably a mistake to be specific about threading. 
>> Removed.
>>
>> 122. Good catch. For DescribeConfigs, the ACL operation should be
>> “DESCRIBE_CONFIGS”. For AlterConfigs, the ACL operation should be
>> “ALTER” (not “WRITE” as it said). The checks are made on the CLUSTER
>> resource.
>>
>> Thanks for the detailed review.
>>
>> Thanks,
>> Andrew
>>
>>>
>>> 110. Another potential motivation is the multiple clients support. Some of
>>> the places may not have good monitoring support for non-java clients.
>>>

Re: [DISCUSS] KIP-714: Client metrics and observability

2023-09-22 Thread Kirk True
Hi Andrew/Jun,

I want to make sure I understand question/comment #119… In the case where a 
cluster without a metrics client receiver is later reconfigured and restarted 
to include a metrics client receiver, do we want the client to thereafter begin 
pushing metrics to the cluster? From Andrew’s response to question #119, it 
sounds like we’re using the presence/absence of the relevant RPCs in 
ApiVersionsResponse as the to-push-or-not-to-push indicator. Do I have that 
correct?

Thanks,
Kirk

> On Sep 21, 2023, at 7:42 AM, Andrew Schofield 
>  wrote:
> 
> Hi Jun,
> Thanks for your comments. I’ve updated the KIP to clarify where necessary.
> 
> 110. Yes, agree. The motivation section mentions this.
> 
> 111. The replacement of ‘-‘ with ‘.’ for metric names and the replacement of
> ‘-‘ with ‘_’ for attribute keys is following the OTLP guidelines. I think 
> it’s a bit
> of a debatable point. OTLP makes a distinction between a namespace and a
> multi-word component. If it was “client.id” then “client” would be a 
> namespace with
> an attribute key “id”. But “client_id” is just a key. So, it was intentional, 
> but debatable.
> 
> 112. Thanks. The link target moved. Fixed.
> 
> 113. Thanks. Fixed.
> 
> 114.1. If a standard metric makes sense for a client, it should use the exact 
> same
> name. If a standard metric doesn’t make sense for a client, then it can omit 
> that metric.
> 
> For a required metric, the situation is stronger. All clients must implement 
> these
> metrics with these names in order to implement the KIP. But the required 
> metrics
> are essentially the number of connections and the request latency, which do 
> not
> reference the underlying implementation of the client (which 
> producer.record.queue.time.max
> of course does).
> 
> I suppose someone might build a producer-only client that didn’t have 
> consumer metrics.
> In this case, the consumer metrics would conceptually have the value 0 and 
> would not
> need to be sent to the broker.
> 
> 114.2. If a client does not implement some metrics, they will not be 
> available for
> analysis and troubleshooting. It just makes the ability to combine metrics 
> from lots
> different clients less complete.
> 
> 115. I think it was probably a mistake to be so specific about threading in 
> this KIP.
> When the consumer threading refactor is complete, of course, it would do the 
> appropriate
> equivalent. I’ve added a clarification and massively simplified this section.
> 
> 116. I removed “client.terminating”.
> 
> 117. Yes. Horrid. Fixed.
> 
> 118. The Terminating flag just indicates that this is the final 
> PushTelemetryRequest
> from this client. Any subsequent request will be rejected. I think this flag 
> should remain.
> 
> 119. Good catch. This was actually contradicting another part of the KIP. The 
> current behaviour
> is indeed preserved. If the broker doesn’t have a client metrics receiver 
> plugin, the new RPCs
> in this KIP are “turned off” and not reported in ApiVersionsResponse. The 
> client will not
> attempt to push metrics.
> 
> 120. The error handling table lists the error codes for 
> PushTelemetryResponse. I’ve added one
> but it looked good to me. GetTelemetrySubscriptions doesn’t have any error 
> codes, since the
> situation in which the client telemetry is not supported is handled by the 
> RPCs not being offered
> by the broker.
> 
> 121. Again, I think it’s probably a mistake to be specific about threading. 
> Removed.
> 
> 122. Good catch. For DescribeConfigs, the ACL operation should be
> “DESCRIBE_CONFIGS”. For AlterConfigs, the ACL operation should be
> “ALTER” (not “WRITE” as it said). The checks are made on the CLUSTER
> resource.
> 
> Thanks for the detailed review.
> 
> Thanks,
> Andrew
> 
>> 
>> 110. Another potential motivation is the multiple clients support. Some of
>> the places may not have good monitoring support for non-java clients.
>> 
>> 111. OpenTelemetry Naming: We replace '-' with '.' for metric name and
>> replace '-' with '_' for attributes. Why is the inconsistency?
>> 
>> 112. OTLP specification: Page is not found from the link.
>> 
>> 113. "Defining standard and required metrics makes the monitoring and
>> troubleshooting of clients from various client types ": Incomplete sentence.
>> 
>> 114. standard/required metrics
>> 114.1 Do other clients need to implement those metrics with the exact same
>> names?
>> 114.2 What happens if some of those metrics are missing from a client?
>> 
>> 115. "KafkaConsumer: both the "heart beat" and application threads": We
>> have an ongoing effort to refactor the consumer threading model (
>> https://cwiki.apache.org/confluence/display/KAFKA/Consumer+threading+refactor+design).
>> Once this is done, PRC requests will only be made from the background
>> thread. Should this KIP follow the new model only?
>> 
>> 116. 'The metrics should contain the reason for the client termination by
>> including the client.terminating metric with the label 

Re: [DISCUSS] KIP-714: Client metrics and observability

2023-09-21 Thread Andrew Schofield
Hi Jun,
Thanks for your comments. I’ve updated the KIP to clarify where necessary.

110. Yes, agree. The motivation section mentions this.

111. The replacement of ‘-‘ with ‘.’ for metric names and the replacement of
‘-‘ with ‘_’ for attribute keys is following the OTLP guidelines. I think it’s 
a bit
of a debatable point. OTLP makes a distinction between a namespace and a
multi-word component. If it was “client.id” then “client” would be a namespace 
with
an attribute key “id”. But “client_id” is just a key. So, it was intentional, 
but debatable.

112. Thanks. The link target moved. Fixed.

113. Thanks. Fixed.

114.1. If a standard metric makes sense for a client, it should use the exact 
same
name. If a standard metric doesn’t make sense for a client, then it can omit 
that metric.

For a required metric, the situation is stronger. All clients must implement 
these
metrics with these names in order to implement the KIP. But the required metrics
are essentially the number of connections and the request latency, which do not
reference the underlying implementation of the client (which 
producer.record.queue.time.max
of course does).

I suppose someone might build a producer-only client that didn’t have consumer 
metrics.
In this case, the consumer metrics would conceptually have the value 0 and 
would not
need to be sent to the broker.

114.2. If a client does not implement some metrics, they will not be available 
for
analysis and troubleshooting. It just makes the ability to combine metrics from 
lots
different clients less complete.

115. I think it was probably a mistake to be so specific about threading in 
this KIP.
When the consumer threading refactor is complete, of course, it would do the 
appropriate
equivalent. I’ve added a clarification and massively simplified this section.

116. I removed “client.terminating”.

117. Yes. Horrid. Fixed.

118. The Terminating flag just indicates that this is the final 
PushTelemetryRequest
from this client. Any subsequent request will be rejected. I think this flag 
should remain.

119. Good catch. This was actually contradicting another part of the KIP. The 
current behaviour
is indeed preserved. If the broker doesn’t have a client metrics receiver 
plugin, the new RPCs
in this KIP are “turned off” and not reported in ApiVersionsResponse. The 
client will not
attempt to push metrics.

120. The error handling table lists the error codes for PushTelemetryResponse. 
I’ve added one
but it looked good to me. GetTelemetrySubscriptions doesn’t have any error 
codes, since the
situation in which the client telemetry is not supported is handled by the RPCs 
not being offered
by the broker.

121. Again, I think it’s probably a mistake to be specific about threading. 
Removed.

122. Good catch. For DescribeConfigs, the ACL operation should be
“DESCRIBE_CONFIGS”. For AlterConfigs, the ACL operation should be
“ALTER” (not “WRITE” as it said). The checks are made on the CLUSTER
resource.

Thanks for the detailed review.

Thanks,
Andrew

>
> 110. Another potential motivation is the multiple clients support. Some of
> the places may not have good monitoring support for non-java clients.
>
> 111. OpenTelemetry Naming: We replace '-' with '.' for metric name and
> replace '-' with '_' for attributes. Why is the inconsistency?
>
> 112. OTLP specification: Page is not found from the link.
>
> 113. "Defining standard and required metrics makes the monitoring and
> troubleshooting of clients from various client types ": Incomplete sentence.
>
> 114. standard/required metrics
> 114.1 Do other clients need to implement those metrics with the exact same
> names?
> 114.2 What happens if some of those metrics are missing from a client?
>
> 115. "KafkaConsumer: both the "heart beat" and application threads": We
> have an ongoing effort to refactor the consumer threading model (
> https://cwiki.apache.org/confluence/display/KAFKA/Consumer+threading+refactor+design).
> Once this is done, PRC requests will only be made from the background
> thread. Should this KIP follow the new model only?
>
> 116. 'The metrics should contain the reason for the client termination by
> including the client.terminating metric with the label “reason” ...'. Hmm,
> are we introducing a new metric client.terminating? If so, that needs to be
> explicitly listed.
>
> 117. "As the metrics plugin may need to add additional metrics on top of
> this the generic metrics receiver in the broker will not add these labels
> but rely on the plugins to do so," The sentence doesn't read well.
>
> 118. "it is possible for the client to send at most one accepted
> out-of-profile per connection before the rate-limiter kicks in": If we do
> this, do we still need the Terminating flag in PushTelemetryRequestV0?
>
> 119. "If there is no client metrics receiver plugin configured on the
> broker, it will respond to GetTelemetrySubscriptionsRequest with
> RequestedMetrics set to Null and a -1 SubscriptionId. The client should
> send 

Re: [DISCUSS] KIP-714: Client metrics and observability

2023-09-20 Thread Jun Rao
Hi, Andrew,

Thanks for updating the KIP and sorry for the late response. A few more
comments.

110. Another potential motivation is the multiple clients support. Some of
the places may not have good monitoring support for non-java clients.

111. OpenTelemetry Naming: We replace '-' with '.' for metric name and
replace '-' with '_' for attributes. Why is the inconsistency?

112. OTLP specification: Page is not found from the link.

113. "Defining standard and required metrics makes the monitoring and
troubleshooting of clients from various client types ": Incomplete sentence.

114. standard/required metrics
114.1 Do other clients need to implement those metrics with the exact same
names?
114.2 What happens if some of those metrics are missing from a client?

115. "KafkaConsumer: both the "heart beat" and application threads": We
have an ongoing effort to refactor the consumer threading model (
https://cwiki.apache.org/confluence/display/KAFKA/Consumer+threading+refactor+design).
Once this is done, PRC requests will only be made from the background
thread. Should this KIP follow the new model only?

116. 'The metrics should contain the reason for the client termination by
including the client.terminating metric with the label “reason” ...'. Hmm,
are we introducing a new metric client.terminating? If so, that needs to be
explicitly listed.

117. "As the metrics plugin may need to add additional metrics on top of
this the generic metrics receiver in the broker will not add these labels
but rely on the plugins to do so," The sentence doesn't read well.

118. "it is possible for the client to send at most one accepted
out-of-profile per connection before the rate-limiter kicks in": If we do
this, do we still need the Terminating flag in PushTelemetryRequestV0?

119. "If there is no client metrics receiver plugin configured on the
broker, it will respond to GetTelemetrySubscriptionsRequest with
RequestedMetrics set to Null and a -1 SubscriptionId. The client should
send a new GetTelemetrySubscriptionsRequest after the PushIntervalMs has
expired. This allows the metrics receiver to be enabled or disabled without
having to restart the broker or reset the client connection."
"no client metrics receiver plugin configured" is defined by no metric
reporter implementing the ClientTelemetry interface, right? In that case,
it would be useful to avoid the clients sending
GetTelemetrySubscriptionsRequest periodically to preserve the current
behavior.

120. GetTelemetrySubscriptionsResponseV0 and PushTelemetryRequestV0: Could
we list error codes for each?

121. "ClientTelemetryReceiver.ClientTelemetryReceiver This method may be
called from the request handling thread": Where else can this method be
called?

122. DescribeConfigs/AlterConfigs already exist. Are we changing the ACL?

Thanks,

Jun

On Mon, Jul 31, 2023 at 4:33 AM Andrew Schofield <
andrew_schofield_j...@outlook.com> wrote:

> Hi Milind,
> Thanks for your question.
>
> On reflection, I agree that INVALID_RECORD is most likely to be caused by a
> problem in the serialization in the client. I have changed the client
> action in this case
> to “Log an error and stop pushing metrics”.
>
> I have updated the KIP text accordingly.
>
> Thanks,
> Andrew
>
> > On 31 Jul 2023, at 12:09, Milind Luthra 
> wrote:
> >
> > Hi Andrew,
> > Thanks for the clarifications.
> >
> > About 2b:
> > In case a client has a bug while serializing, it might be difficult for
> the
> > client to recover from that without code changes. In that, it might be
> good
> > to just log the INVALID_RECORD as an error, and treat the error as fatal
> > for the client (only fatal in terms of sending the metrics, the client
> can
> > keep functioning otherwise). What do you think?
> >
> > Thanks
> > Milind
> >
> > On Mon, Jul 24, 2023 at 8:18 PM Andrew Schofield <
> > andrew_schofield_j...@outlook.com> wrote:
> >
> >> Hi Milind,
> >> Thanks for your questions about the KIP.
> >>
> >> 1) I did some archaeology and looked at historical versions of the KIP.
> I
> >> think this is
> >> just a mistake. 5 minutes is the default metric push interval. 30
> minutes
> >> is a mystery
> >> to me. I’ve updated the KIP.
> >>
> >> 2) I think there are two situations in which INVALID_RECORD might occur.
> >> a) The client might perhaps be using a content-type that the broker does
> >> not support.
> >> The KIP mentions content-type as a future extension, but there’s only
> one
> >> supported
> >> to start with. Until we have multiple content-types, this seems out of
> >> scope. I think a
> >> future KIP would add another error code for this.
> >> b) The client might perhaps have a bug which means the metrics payload
> is
> >> malformed.
> >> Logging a warning and attempting the next metrics push on the push
> >> interval seems
> >> appropriate.
> >>
> >> UNKNOWN_SUBSCRIPTION_ID would indeed be handled by making an immediate
> >> GetTelemetrySubscriptionsRequest.
> >>
> >> UNSUPPORTED_COMPRESSION_TYPE seems like either 

Re: [DISCUSS] KIP-714: Client metrics and observability

2023-08-25 Thread Andrew Schofield
Hi Tom,
Thanks for your comments. Sorry for the delay in responding. They’re still 
useful
comments in spite of the fact that the voting has begun.

1) This is a good idea. I expect the client will emit the client instance ID
as a log line.

2) I will add PROXY protocol support to the future work. I agree.

3) Thanks for the suggestion.

4) Yes, the client authenticates before it can send any of the RPCs in this KIP.

5) a) Yes, a rogue client could in principle send metrics to all brokers 
resulting
in a lot of data being exported to the back end. Of course, a proper deployment
of a client telemetry reporter plugin would be instrumented to help the operator
diagnose this situation.

b) There are already instances of the client sending compressed data to
the broker. I think it is prudent to limit the maximum metrics payload size.
I will update the KIP accordingly.

6) Yes, bundling and relocating.

7) I will add a broker metric to help with diagnosis.

8) I will add some clarifying text. If the broker does not have a configured
metrics reporter that supports the new interface, it should not push metrics
and will not receive a metrics subscription. I am thinking over the options for
achieving this cleanly and will update the KIP.

Thanks for your interest in the KIP.

Thanks,
Andrew

> On 11 Aug 2023, at 09:48, Tom Bentley  wrote:
>
> Hi Andrew,
>
> Thanks for picking this KIP up. I know you've started a vote, so these are
> unhelpfully late... sorry about that, but hopefully they're still useful.
>
> 1. "The Kafka Java client provides an API for the application to read the
> generated client instance id to assist in mapping and identification of the
> client based on the collected metrics." In the multi-client, single-process
> model perhaps it would be desirable to have the option of including this in
> log messages emitted by the client library.
>
> 2. "Mapping the client instance id to an actual application instance
> running on a (virtual) machine can be done by inspecting the metrics
> resource labels, such as the client source address and source port, or
> security principal, all of which are added by the receiving broker." The
> specific example of correlation given here (source IP address) is
> problematic in environments where there may be network proxies (e.g.
> Kubernetes ingress) on the path between client and broker: The broker sees
> the IP address of the proxy. This is a rough edge which could be smoothed
> out if Kafka supported the PROXY protocol[1] which seems to have become
> something of a defacto standard. I'm not suggesting this need to be part of
> the KIP, but perhaps it could be added to Future Work?
> [1]: http://www.haproxy.org/download/2.9/doc/proxy-protocol.txt
>
> 3. Compression... just an idle idea, but I wonder if a useful further
> improvement in compression ratio could be achieve using zstd's support for
> dictionary compression[2]. I.e. a client could initially use training mode
> when sending metrics, but eventually include a dictionary to be used for
> subsequent metric sends. It's probably not worth the effort (at least for
> the initial implementation), but since you've gone to the effort of
> providing some numbers anyway, maybe it's not much additional effort to at
> least find out whether this makes a useful difference.
> [2]: http://facebook.github.io/zstd/#small-data
>
> 4. Maybe I didn't spot it, but I assume the initial
> GetTelemetrySubscriptionsRequest
> happens after authentication?
>
> 5. Rogue clients -- There are some interesting things to consider if we're
> trying to defend against a genuinely adversarial client.
>
> a) Client sends profiling information to all brokers at the maximum rate.
> Each broker forwards to the time series DB. Obviously this scales linearly
> with number of brokers, but it's clear that the load on the tsdb could be
> many times larger than users might expect.
> b) Client sends crafted compressed data which decompresses to require more
> memory that the broker can allocate.
>
> 6. Shadowing the OLTP and protobuf jars -- to be clear by this you mean
> both bundling _and_ relocating?
>
> 7. "In case the cluster load induced from metrics requests becomes
> unmanageable the remedy is to temporarily remove or limit configured
> metrics subscriptions.  " How would a user know that the observed load was
> due to handling metrics requests?
>
> 8. If I understand correctly, when the configured metrics reporter does not
> implement the new interface the client would still follow the described
> protocol only to have nowhere to send the metrics. Am I overlooking
> something?
>
> Thanks again,
>
> Tom
>
> On Fri, 11 Aug 2023 at 07:52, Andrew Schofield <
> andrew_schofield_j...@outlook.com> wrote:
>
>> Hi Doguscan,
>> Thanks for your question.
>>
>> If the target broker is unreachable, the client can send the metrics to
>> another
>> broker. It can select any of the other brokers for this purpose. What I
>> expect in
>> practice is that it 

Re: [DISCUSS] KIP-714: Client metrics and observability

2023-08-11 Thread Tom Bentley
Hi Andrew,

Thanks for picking this KIP up. I know you've started a vote, so these are
unhelpfully late... sorry about that, but hopefully they're still useful.

1. "The Kafka Java client provides an API for the application to read the
generated client instance id to assist in mapping and identification of the
client based on the collected metrics." In the multi-client, single-process
model perhaps it would be desirable to have the option of including this in
log messages emitted by the client library.

2. "Mapping the client instance id to an actual application instance
running on a (virtual) machine can be done by inspecting the metrics
resource labels, such as the client source address and source port, or
security principal, all of which are added by the receiving broker." The
specific example of correlation given here (source IP address) is
problematic in environments where there may be network proxies (e.g.
Kubernetes ingress) on the path between client and broker: The broker sees
the IP address of the proxy. This is a rough edge which could be smoothed
out if Kafka supported the PROXY protocol[1] which seems to have become
something of a defacto standard. I'm not suggesting this need to be part of
the KIP, but perhaps it could be added to Future Work?
[1]: http://www.haproxy.org/download/2.9/doc/proxy-protocol.txt

3. Compression... just an idle idea, but I wonder if a useful further
improvement in compression ratio could be achieve using zstd's support for
dictionary compression[2]. I.e. a client could initially use training mode
when sending metrics, but eventually include a dictionary to be used for
subsequent metric sends. It's probably not worth the effort (at least for
the initial implementation), but since you've gone to the effort of
providing some numbers anyway, maybe it's not much additional effort to at
least find out whether this makes a useful difference.
[2]: http://facebook.github.io/zstd/#small-data

4. Maybe I didn't spot it, but I assume the initial
GetTelemetrySubscriptionsRequest
happens after authentication?

5. Rogue clients -- There are some interesting things to consider if we're
trying to defend against a genuinely adversarial client.

a) Client sends profiling information to all brokers at the maximum rate.
Each broker forwards to the time series DB. Obviously this scales linearly
with number of brokers, but it's clear that the load on the tsdb could be
many times larger than users might expect.
b) Client sends crafted compressed data which decompresses to require more
memory that the broker can allocate.

6. Shadowing the OLTP and protobuf jars -- to be clear by this you mean
both bundling _and_ relocating?

7. "In case the cluster load induced from metrics requests becomes
unmanageable the remedy is to temporarily remove or limit configured
metrics subscriptions.  " How would a user know that the observed load was
due to handling metrics requests?

8. If I understand correctly, when the configured metrics reporter does not
implement the new interface the client would still follow the described
protocol only to have nowhere to send the metrics. Am I overlooking
something?

Thanks again,

Tom

On Fri, 11 Aug 2023 at 07:52, Andrew Schofield <
andrew_schofield_j...@outlook.com> wrote:

> Hi Doguscan,
> Thanks for your question.
>
> If the target broker is unreachable, the client can send the metrics to
> another
> broker. It can select any of the other brokers for this purpose. What I
> expect in
> practice is that it loses connection to the broker it’s being using for
> metrics,
> chooses or establishes a connection to another broker, and then selects
> that
> broker for subsequent metrics pushes.
>
> Thanks,
> Andrew
>
> > On 8 Aug 2023, at 08:34, Doğuşcan Namal 
> wrote:
> >
> > Thanks for your answers Andrew. I share your pain that it took a while
> for
> > you to get this KIP approved and you want to reduce the scope of it, will
> > be happy to help you with the implementation :)
> >
> > Could you help me walk through what happens if the target broker is
> > unreachable? Is the client going to drop these metrics or is it going to
> > send it to the other brokers it is connected to? This information is
> > crucial to understand the client side impact on leadership failovers.
> > Moreover, in case of partial outages, such as only the network between
> the
> > client and the broker is partitioned whereas the network within the
> cluster
> > is healthy, practically there is no other way than the client side
> metrics
> > to identify this problem.
> >
> > Doguscan
> >
> > On Fri, 4 Aug 2023 at 15:33, Andrew Schofield <
> > andrew_schofield_j...@outlook.com> wrote:
> >
> >> Hi Doguscan,
> >> Thanks for your comments. I’m glad to hear you’re interested in this
> KIP.
> >>
> >> 1) It is preferred that a client sends its metrics to the same broker
> >> connection
> >> but actually it is able to send them to any broker. As a result, if a
> >> broker becomes
> >> unhealthy, the 

Re: [DISCUSS] KIP-714: Client metrics and observability

2023-08-11 Thread Andrew Schofield
Hi Doguscan,
Thanks for your question.

If the target broker is unreachable, the client can send the metrics to another
broker. It can select any of the other brokers for this purpose. What I expect 
in
practice is that it loses connection to the broker it’s being using for metrics,
chooses or establishes a connection to another broker, and then selects that
broker for subsequent metrics pushes.

Thanks,
Andrew

> On 8 Aug 2023, at 08:34, Doğuşcan Namal  wrote:
>
> Thanks for your answers Andrew. I share your pain that it took a while for
> you to get this KIP approved and you want to reduce the scope of it, will
> be happy to help you with the implementation :)
>
> Could you help me walk through what happens if the target broker is
> unreachable? Is the client going to drop these metrics or is it going to
> send it to the other brokers it is connected to? This information is
> crucial to understand the client side impact on leadership failovers.
> Moreover, in case of partial outages, such as only the network between the
> client and the broker is partitioned whereas the network within the cluster
> is healthy, practically there is no other way than the client side metrics
> to identify this problem.
>
> Doguscan
>
> On Fri, 4 Aug 2023 at 15:33, Andrew Schofield <
> andrew_schofield_j...@outlook.com> wrote:
>
>> Hi Doguscan,
>> Thanks for your comments. I’m glad to hear you’re interested in this KIP.
>>
>> 1) It is preferred that a client sends its metrics to the same broker
>> connection
>> but actually it is able to send them to any broker. As a result, if a
>> broker becomes
>> unhealthy, the client can push its metrics to any other broker. It seems
>> to me that
>> pushing to KRaft controllers instead just has the effect of increasing the
>> load on
>> the controllers, while still having the characteristic that an unhealthy
>> controller
>> would present inconvenience for collecting metrics.
>>
>> 2) When the `PushTelemetryRequest.Terminating` flag is set, the standard
>> request
>> throttling is not disabled. The metrics rate-limiting based on the push
>> interval is
>> not applied in this case for a single request for the combination of
>> client instance ID
>> and subscription ID.
>>
>> (I have corrected the KIP text because it erroneously said “client ID and
>> subscription ID”.
>>
>> 3) While this is a theoretical problem, I’m not keen on adding yet more
>> configurations
>> to the broker or client. The `interval.ms` configuration on the
>> CLIENT_METRICS
>> resource could perhaps have a minimum and maximum value to prevent
>> accidental
>> misconfiguration.
>>
>> 4) One of the reasons that this KIP has taken so long to get to this stage
>> is that
>> it tried to do many things all at once. So, it’s greatly simplified
>> compared with
>> 6 months ago. I can see the value of collecting client configurations for
>> problem
>> determination, but I don’t want to make this KIP more complicated. I think
>> the
>> idea has merit as a separate follow-on KIP. I would be happy to collaborate
>> with you on this.
>>
>> 5) The default is set to 5 minutes to minimise the load on the broker for
>> situations
>> in which the administrator didn’t set an interval on a metrics
>> subscription. To
>> use an interval of 1 minute, it is only necessary to set `interval.ms` on
>> the metrics
>> subscription to 6ms.
>>
>> 6) Uncompressed data is always supported. The KIP says:
>> "The CompressionType of NONE will not be
>> "present in the response from the broker, though the broker does support
>> uncompressed
>> "client telemetry if none of the accepted compression codecs are supported
>> by the client.”
>> So in your example, the client need only use CompressionType=NONE.
>>
>> Thanks,
>> Andrew
>>
>>> On 4 Aug 2023, at 14:04, Doğuşcan Namal 
>> wrote:
>>>
>>> Hi Andrew, thanks a lot for this KIP. I was thinking of something similar
>>> so thanks for writing this down 
>>>
>>>
>>>
>>> Couple of questions related to the design:
>>>
>>>
>>>
>>> 1. Can we investigate the option for using the Kraft controllers instead
>> of
>>> the brokers for sending metrics? The disadvantage of sending these
>> metrics
>>> directly to the brokers tightly couples metric observability to data
>> plane
>>> availability. If the broker is unhealthy then the root cause of an
>> incident
>>> is clear however on partial failures it makes it hard to debug these
>>> incidents from the brokers perspective.
>>>
>>>
>>>
>>> 2. Ratelimiting will be disable if the `PushTelemetryRequest.Terminating`
>>> flag is set. However, this may cause unavailability on the broker if too
>>> many clients are terminated at once, especially network threads could
>>> become busy and introduce latency on the produce/consume on other
>>> non-terminating clients connections. I think there is a room for
>>> improvement here. If the client is gracefully shutting down, it could
>> wait
>>> for the request to be handled if it is being ratelimited, it doesn't need
>>> 

Re: [DISCUSS] KIP-714: Client metrics and observability

2023-08-08 Thread Doğuşcan Namal
Thanks for your answers Andrew. I share your pain that it took a while for
you to get this KIP approved and you want to reduce the scope of it, will
be happy to help you with the implementation :)

Could you help me walk through what happens if the target broker is
unreachable? Is the client going to drop these metrics or is it going to
send it to the other brokers it is connected to? This information is
crucial to understand the client side impact on leadership failovers.
Moreover, in case of partial outages, such as only the network between the
client and the broker is partitioned whereas the network within the cluster
is healthy, practically there is no other way than the client side metrics
to identify this problem.

Doguscan

On Fri, 4 Aug 2023 at 15:33, Andrew Schofield <
andrew_schofield_j...@outlook.com> wrote:

> Hi Doguscan,
> Thanks for your comments. I’m glad to hear you’re interested in this KIP.
>
> 1) It is preferred that a client sends its metrics to the same broker
> connection
> but actually it is able to send them to any broker. As a result, if a
> broker becomes
> unhealthy, the client can push its metrics to any other broker. It seems
> to me that
> pushing to KRaft controllers instead just has the effect of increasing the
> load on
> the controllers, while still having the characteristic that an unhealthy
> controller
> would present inconvenience for collecting metrics.
>
> 2) When the `PushTelemetryRequest.Terminating` flag is set, the standard
> request
> throttling is not disabled. The metrics rate-limiting based on the push
> interval is
> not applied in this case for a single request for the combination of
> client instance ID
> and subscription ID.
>
> (I have corrected the KIP text because it erroneously said “client ID and
> subscription ID”.
>
> 3) While this is a theoretical problem, I’m not keen on adding yet more
> configurations
> to the broker or client. The `interval.ms` configuration on the
> CLIENT_METRICS
> resource could perhaps have a minimum and maximum value to prevent
> accidental
> misconfiguration.
>
> 4) One of the reasons that this KIP has taken so long to get to this stage
> is that
> it tried to do many things all at once. So, it’s greatly simplified
> compared with
> 6 months ago. I can see the value of collecting client configurations for
> problem
> determination, but I don’t want to make this KIP more complicated. I think
> the
> idea has merit as a separate follow-on KIP. I would be happy to collaborate
> with you on this.
>
> 5) The default is set to 5 minutes to minimise the load on the broker for
> situations
> in which the administrator didn’t set an interval on a metrics
> subscription. To
> use an interval of 1 minute, it is only necessary to set `interval.ms` on
> the metrics
> subscription to 6ms.
>
> 6) Uncompressed data is always supported. The KIP says:
>  "The CompressionType of NONE will not be
> "present in the response from the broker, though the broker does support
> uncompressed
> "client telemetry if none of the accepted compression codecs are supported
> by the client.”
> So in your example, the client need only use CompressionType=NONE.
>
> Thanks,
> Andrew
>
> > On 4 Aug 2023, at 14:04, Doğuşcan Namal 
> wrote:
> >
> > Hi Andrew, thanks a lot for this KIP. I was thinking of something similar
> > so thanks for writing this down 
> >
> >
> >
> > Couple of questions related to the design:
> >
> >
> >
> > 1. Can we investigate the option for using the Kraft controllers instead
> of
> > the brokers for sending metrics? The disadvantage of sending these
> metrics
> > directly to the brokers tightly couples metric observability to data
> plane
> > availability. If the broker is unhealthy then the root cause of an
> incident
> > is clear however on partial failures it makes it hard to debug these
> > incidents from the brokers perspective.
> >
> >
> >
> > 2. Ratelimiting will be disable if the `PushTelemetryRequest.Terminating`
> > flag is set. However, this may cause unavailability on the broker if too
> > many clients are terminated at once, especially network threads could
> > become busy and introduce latency on the produce/consume on other
> > non-terminating clients connections. I think there is a room for
> > improvement here. If the client is gracefully shutting down, it could
> wait
> > for the request to be handled if it is being ratelimited, it doesn't need
> > to "force push" the metrics. For that reason, maybe we could define a
> > separate ratelimiting for telemetry data?
> >
> >
> >
> > 3. `PushIntervalMs` is set on the client side by a response from
> > `GetTelemetrySubscriptionsResponse`. If the broker sets this value to too
> > low, like 1msec, this may hog all of the clients activity and cause an
> > impact on the client side. I think we should introduce a configuration
> both
> > on the client and the broker side for the minimum and maximum numbers for
> > this value to fence out misconfigurations.
> >
> >
> >
> 

RE: [DISCUSS] KIP-714: Client metrics and observability

2023-08-04 Thread Doğuşcan Namal
Hi Andrew, thanks a lot for this KIP. I was thinking of something similar
so thanks for writing this down 



Couple of questions related to the design:



1. Can we investigate the option for using the Kraft controllers instead of
the brokers for sending metrics? The disadvantage of sending these metrics
directly to the brokers tightly couples metric observability to data plane
availability. If the broker is unhealthy then the root cause of an incident
is clear however on partial failures it makes it hard to debug these
incidents from the brokers perspective.



2. Ratelimiting will be disable if the `PushTelemetryRequest.Terminating`
flag is set. However, this may cause unavailability on the broker if too
many clients are terminated at once, especially network threads could
become busy and introduce latency on the produce/consume on other
non-terminating clients connections. I think there is a room for
improvement here. If the client is gracefully shutting down, it could wait
for the request to be handled if it is being ratelimited, it doesn't need
to "force push" the metrics. For that reason, maybe we could define a
separate ratelimiting for telemetry data?



3. `PushIntervalMs` is set on the client side by a response from
`GetTelemetrySubscriptionsResponse`. If the broker sets this value to too
low, like 1msec, this may hog all of the clients activity and cause an
impact on the client side. I think we should introduce a configuration both
on the client and the broker side for the minimum and maximum numbers for
this value to fence out misconfigurations.



4. One of the important things I face during debugging the client side
failures is to understand the client side configurations. Can the client
sends these configs during the GetTelemetrySubscriptions request as well?



Small comments:

5. Default PushIntervalMs is 5 minutes. Can we make it 1 minute instead? I
think 5 minutes of aggregated data is too not helpful in the world of
telemetry 

6. UnsupportedCompressionType: Shall we fallback to non-compression mode in
that case? I think compression is nice to have, but non-compressed
telemetry data is valuable as well. Especially for low throughput clients,
compressing telemetry data may cause more CPU load then the actual data
plane work.


Thanks again.

Doguscan



> On Jun 13, 2023, at 8:06 AM, Andrew Schofield

>  wrote:

>

> Hi,

> I would like to start a new discussion thread on KIP-714: Client metrics
and

> observability.

>

>
https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability

>

> I have edited the proposal significantly to reduce the scope. The overall

> mechanism for client metric subscriptions is unchanged, but the

> KIP is now based on the existing client metrics, rather than introducing
new

> metrics. The purpose remains helping cluster operators

> investigate performance problems experienced by clients without requiring

> changes to the client application code or configuration.

>

> Thanks,

> Andrew


Re: [DISCUSS] KIP-714: Client metrics and observability

2023-08-04 Thread Andrew Schofield
Hi Doguscan,
Thanks for your comments. I’m glad to hear you’re interested in this KIP.

1) It is preferred that a client sends its metrics to the same broker connection
but actually it is able to send them to any broker. As a result, if a broker 
becomes
unhealthy, the client can push its metrics to any other broker. It seems to me 
that
pushing to KRaft controllers instead just has the effect of increasing the load 
on
the controllers, while still having the characteristic that an unhealthy 
controller
would present inconvenience for collecting metrics.

2) When the `PushTelemetryRequest.Terminating` flag is set, the standard request
throttling is not disabled. The metrics rate-limiting based on the push 
interval is
not applied in this case for a single request for the combination of client 
instance ID
and subscription ID.

(I have corrected the KIP text because it erroneously said “client ID and 
subscription ID”.

3) While this is a theoretical problem, I’m not keen on adding yet more 
configurations
to the broker or client. The `interval.ms` configuration on the CLIENT_METRICS
resource could perhaps have a minimum and maximum value to prevent accidental
misconfiguration.

4) One of the reasons that this KIP has taken so long to get to this stage is 
that
it tried to do many things all at once. So, it’s greatly simplified compared 
with
6 months ago. I can see the value of collecting client configurations for 
problem
determination, but I don’t want to make this KIP more complicated. I think the
idea has merit as a separate follow-on KIP. I would be happy to collaborate
with you on this.

5) The default is set to 5 minutes to minimise the load on the broker for 
situations
in which the administrator didn’t set an interval on a metrics subscription. To
use an interval of 1 minute, it is only necessary to set `interval.ms` on the 
metrics
subscription to 6ms.

6) Uncompressed data is always supported. The KIP says:
 "The CompressionType of NONE will not be
"present in the response from the broker, though the broker does support 
uncompressed
"client telemetry if none of the accepted compression codecs are supported by 
the client.”
So in your example, the client need only use CompressionType=NONE.

Thanks,
Andrew

> On 4 Aug 2023, at 14:04, Doğuşcan Namal  wrote:
>
> Hi Andrew, thanks a lot for this KIP. I was thinking of something similar
> so thanks for writing this down 
>
>
>
> Couple of questions related to the design:
>
>
>
> 1. Can we investigate the option for using the Kraft controllers instead of
> the brokers for sending metrics? The disadvantage of sending these metrics
> directly to the brokers tightly couples metric observability to data plane
> availability. If the broker is unhealthy then the root cause of an incident
> is clear however on partial failures it makes it hard to debug these
> incidents from the brokers perspective.
>
>
>
> 2. Ratelimiting will be disable if the `PushTelemetryRequest.Terminating`
> flag is set. However, this may cause unavailability on the broker if too
> many clients are terminated at once, especially network threads could
> become busy and introduce latency on the produce/consume on other
> non-terminating clients connections. I think there is a room for
> improvement here. If the client is gracefully shutting down, it could wait
> for the request to be handled if it is being ratelimited, it doesn't need
> to "force push" the metrics. For that reason, maybe we could define a
> separate ratelimiting for telemetry data?
>
>
>
> 3. `PushIntervalMs` is set on the client side by a response from
> `GetTelemetrySubscriptionsResponse`. If the broker sets this value to too
> low, like 1msec, this may hog all of the clients activity and cause an
> impact on the client side. I think we should introduce a configuration both
> on the client and the broker side for the minimum and maximum numbers for
> this value to fence out misconfigurations.
>
>
>
> 4. One of the important things I face during debugging the client side
> failures is to understand the client side configurations. Can the client
> sends these configs during the GetTelemetrySubscriptions request as well?
>
>
>
> Small comments:
>
> 5. Default PushIntervalMs is 5 minutes. Can we make it 1 minute instead? I
> think 5 minutes of aggregated data is too not helpful in the world of
> telemetry 
>
> 6. UnsupportedCompressionType: Shall we fallback to non-compression mode in
> that case? I think compression is nice to have, but non-compressed
> telemetry data is valuable as well. Especially for low throughput clients,
> compressing telemetry data may cause more CPU load then the actual data
> plane work.
>
>
> Thanks again.
>
> Doguscan
>
>
>
>> On Jun 13, 2023, at 8:06 AM, Andrew Schofield
>
>>  wrote:
>
>>
>
>> Hi,
>
>> I would like to start a new discussion thread on KIP-714: Client metrics
> and
>
>> observability.
>
>>
>
>>
> 

Re: [DISCUSS] KIP-714: Client metrics and observability

2023-07-31 Thread Andrew Schofield
Hi Milind,
Thanks for your question.

On reflection, I agree that INVALID_RECORD is most likely to be caused by a
problem in the serialization in the client. I have changed the client action in 
this case
to “Log an error and stop pushing metrics”.

I have updated the KIP text accordingly.

Thanks,
Andrew

> On 31 Jul 2023, at 12:09, Milind Luthra  wrote:
>
> Hi Andrew,
> Thanks for the clarifications.
>
> About 2b:
> In case a client has a bug while serializing, it might be difficult for the
> client to recover from that without code changes. In that, it might be good
> to just log the INVALID_RECORD as an error, and treat the error as fatal
> for the client (only fatal in terms of sending the metrics, the client can
> keep functioning otherwise). What do you think?
>
> Thanks
> Milind
>
> On Mon, Jul 24, 2023 at 8:18 PM Andrew Schofield <
> andrew_schofield_j...@outlook.com> wrote:
>
>> Hi Milind,
>> Thanks for your questions about the KIP.
>>
>> 1) I did some archaeology and looked at historical versions of the KIP. I
>> think this is
>> just a mistake. 5 minutes is the default metric push interval. 30 minutes
>> is a mystery
>> to me. I’ve updated the KIP.
>>
>> 2) I think there are two situations in which INVALID_RECORD might occur.
>> a) The client might perhaps be using a content-type that the broker does
>> not support.
>> The KIP mentions content-type as a future extension, but there’s only one
>> supported
>> to start with. Until we have multiple content-types, this seems out of
>> scope. I think a
>> future KIP would add another error code for this.
>> b) The client might perhaps have a bug which means the metrics payload is
>> malformed.
>> Logging a warning and attempting the next metrics push on the push
>> interval seems
>> appropriate.
>>
>> UNKNOWN_SUBSCRIPTION_ID would indeed be handled by making an immediate
>> GetTelemetrySubscriptionsRequest.
>>
>> UNSUPPORTED_COMPRESSION_TYPE seems like either a client bug or perhaps
>> a situation in which a broker sends a compression type in a
>> GetTelemetrySubscriptionsResponse
>> which is subsequently not supported when its used with a
>> PushTelemetryRequest.
>> We do want the client to have the opportunity to get an up-to-date list of
>> supported
>> compression types. I think an immediate GetTelemetrySubscriptionsRequest
>> is appropriate.
>>
>> 3) If a client attempts a subsequent handshake with a Null
>> ClientInstanceId, the
>> receiving broker may not already know the client's existing
>> ClientInstanceId. If the
>> receiving broker knows the existing ClientInstanceId, it simply responds
>> the existing
>> value back to the client. If it does not know the existing
>> ClientInstanceId, it will create
>> a new client instance ID and respond with that.
>>
>> I will update the KIP with these clarifications.
>>
>> Thanks,
>> Andrew
>>
>>> On 17 Jul 2023, at 14:21, Milind Luthra 
>> wrote:
>>>
>>> Hi Andrew, thanks for this KIP.
>>>
>>> I had a few questions regarding the "Error handling" section.
>>>
>>> 1. It mentions that "The 5 and 30 minute retries are to eventually
>> trigger
>>> a retry and avoid having to restart clients if the cluster metrics
>>> configuration is disabled temporarily, e.g., by operator error, rolling
>>> upgrades, etc."
>>> But this 30 min interval isn't mentioned anywhere else. What is it
>>> referring to?
>>>
>>> 2. For the actual errors:
>>> INVALID_RECORD : The action required is to "Log a warning to the
>>> application and schedule the next GetTelemetrySubscriptionsRequest to 5
>>> minutes". Why is this 5 minutes, and not something like PushIntervalMs?
>> And
>>> also, why are we scheduling a GetTelemetrySubscriptionsRequest in this
>>> case, if the serialization is broken?
>>> UNKNOWN_SUBSCRIPTION_ID , UNSUPPORTED_COMPRESSION_TYPE : just to confirm,
>>> the GetTelemetrySubscriptionsRequest needs to be scheduled immediately
>>> after the PushTelemetry response, correct?
>>>
>>> 3. For "Subsequent GetTelemetrySubscriptionsRequests must include the
>>> ClientInstanceId returned in the first response, regardless of broker":
>>> Will a broker error be returned in case some implementation of this KIP
>>> violates this accidentally and sends a request with ClientInstanceId =
>> Null
>>> even when it's been obtained already? Or will a new ClientInstanceId be
>>> returned without an error?
>>>
>>> Thanks!
>>>
>>> On Tue, Jun 13, 2023 at 8:38 PM Andrew Schofield <
>>> andrew_schofield_j...@outlook.com> wrote:
>>>
 Hi,
 I would like to start a new discussion thread on KIP-714: Client metrics
 and observability.



>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability

 I have edited the proposal significantly to reduce the scope. The
>> overall
 mechanism for client metric subscriptions is unchanged, but the
 KIP is now based on the existing client metrics, rather than introducing
 new metrics. The purpose remains helping cluster 

Re: [DISCUSS] KIP-714: Client metrics and observability

2023-07-31 Thread Milind Luthra
Hi Andrew,
Thanks for the clarifications.

About 2b:
In case a client has a bug while serializing, it might be difficult for the
client to recover from that without code changes. In that, it might be good
to just log the INVALID_RECORD as an error, and treat the error as fatal
for the client (only fatal in terms of sending the metrics, the client can
keep functioning otherwise). What do you think?

Thanks
Milind

On Mon, Jul 24, 2023 at 8:18 PM Andrew Schofield <
andrew_schofield_j...@outlook.com> wrote:

> Hi Milind,
> Thanks for your questions about the KIP.
>
> 1) I did some archaeology and looked at historical versions of the KIP. I
> think this is
> just a mistake. 5 minutes is the default metric push interval. 30 minutes
> is a mystery
> to me. I’ve updated the KIP.
>
> 2) I think there are two situations in which INVALID_RECORD might occur.
> a) The client might perhaps be using a content-type that the broker does
> not support.
> The KIP mentions content-type as a future extension, but there’s only one
> supported
> to start with. Until we have multiple content-types, this seems out of
> scope. I think a
> future KIP would add another error code for this.
> b) The client might perhaps have a bug which means the metrics payload is
> malformed.
> Logging a warning and attempting the next metrics push on the push
> interval seems
> appropriate.
>
> UNKNOWN_SUBSCRIPTION_ID would indeed be handled by making an immediate
> GetTelemetrySubscriptionsRequest.
>
> UNSUPPORTED_COMPRESSION_TYPE seems like either a client bug or perhaps
> a situation in which a broker sends a compression type in a
> GetTelemetrySubscriptionsResponse
> which is subsequently not supported when its used with a
> PushTelemetryRequest.
> We do want the client to have the opportunity to get an up-to-date list of
> supported
> compression types. I think an immediate GetTelemetrySubscriptionsRequest
> is appropriate.
>
> 3) If a client attempts a subsequent handshake with a Null
> ClientInstanceId, the
> receiving broker may not already know the client's existing
> ClientInstanceId. If the
> receiving broker knows the existing ClientInstanceId, it simply responds
> the existing
> value back to the client. If it does not know the existing
> ClientInstanceId, it will create
> a new client instance ID and respond with that.
>
> I will update the KIP with these clarifications.
>
> Thanks,
> Andrew
>
> > On 17 Jul 2023, at 14:21, Milind Luthra 
> wrote:
> >
> > Hi Andrew, thanks for this KIP.
> >
> > I had a few questions regarding the "Error handling" section.
> >
> > 1. It mentions that "The 5 and 30 minute retries are to eventually
> trigger
> > a retry and avoid having to restart clients if the cluster metrics
> > configuration is disabled temporarily, e.g., by operator error, rolling
> > upgrades, etc."
> > But this 30 min interval isn't mentioned anywhere else. What is it
> > referring to?
> >
> > 2. For the actual errors:
> > INVALID_RECORD : The action required is to "Log a warning to the
> > application and schedule the next GetTelemetrySubscriptionsRequest to 5
> > minutes". Why is this 5 minutes, and not something like PushIntervalMs?
> And
> > also, why are we scheduling a GetTelemetrySubscriptionsRequest in this
> > case, if the serialization is broken?
> > UNKNOWN_SUBSCRIPTION_ID , UNSUPPORTED_COMPRESSION_TYPE : just to confirm,
> > the GetTelemetrySubscriptionsRequest needs to be scheduled immediately
> > after the PushTelemetry response, correct?
> >
> > 3. For "Subsequent GetTelemetrySubscriptionsRequests must include the
> > ClientInstanceId returned in the first response, regardless of broker":
> > Will a broker error be returned in case some implementation of this KIP
> > violates this accidentally and sends a request with ClientInstanceId =
> Null
> > even when it's been obtained already? Or will a new ClientInstanceId be
> > returned without an error?
> >
> > Thanks!
> >
> > On Tue, Jun 13, 2023 at 8:38 PM Andrew Schofield <
> > andrew_schofield_j...@outlook.com> wrote:
> >
> >> Hi,
> >> I would like to start a new discussion thread on KIP-714: Client metrics
> >> and observability.
> >>
> >>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
> >>
> >> I have edited the proposal significantly to reduce the scope. The
> overall
> >> mechanism for client metric subscriptions is unchanged, but the
> >> KIP is now based on the existing client metrics, rather than introducing
> >> new metrics. The purpose remains helping cluster operators
> >> investigate performance problems experienced by clients without
> requiring
> >> changes to the client application code or configuration.
> >>
> >> Thanks,
> >> Andrew
>
>


Re: [DISCUSS] KIP-714: Client metrics and observability

2023-07-24 Thread Andrew Schofield
Hi Milind,
Thanks for your questions about the KIP.

1) I did some archaeology and looked at historical versions of the KIP. I think 
this is
just a mistake. 5 minutes is the default metric push interval. 30 minutes is a 
mystery
to me. I’ve updated the KIP.

2) I think there are two situations in which INVALID_RECORD might occur.
a) The client might perhaps be using a content-type that the broker does not 
support.
The KIP mentions content-type as a future extension, but there’s only one 
supported
to start with. Until we have multiple content-types, this seems out of scope. I 
think a
future KIP would add another error code for this.
b) The client might perhaps have a bug which means the metrics payload is 
malformed.
Logging a warning and attempting the next metrics push on the push interval 
seems
appropriate.

UNKNOWN_SUBSCRIPTION_ID would indeed be handled by making an immediate
GetTelemetrySubscriptionsRequest.

UNSUPPORTED_COMPRESSION_TYPE seems like either a client bug or perhaps
a situation in which a broker sends a compression type in a 
GetTelemetrySubscriptionsResponse
which is subsequently not supported when its used with a PushTelemetryRequest.
We do want the client to have the opportunity to get an up-to-date list of 
supported
compression types. I think an immediate GetTelemetrySubscriptionsRequest is 
appropriate.

3) If a client attempts a subsequent handshake with a Null ClientInstanceId, the
receiving broker may not already know the client's existing ClientInstanceId. 
If the
receiving broker knows the existing ClientInstanceId, it simply responds the 
existing
value back to the client. If it does not know the existing ClientInstanceId, it 
will create
a new client instance ID and respond with that.

I will update the KIP with these clarifications.

Thanks,
Andrew

> On 17 Jul 2023, at 14:21, Milind Luthra  wrote:
>
> Hi Andrew, thanks for this KIP.
>
> I had a few questions regarding the "Error handling" section.
>
> 1. It mentions that "The 5 and 30 minute retries are to eventually trigger
> a retry and avoid having to restart clients if the cluster metrics
> configuration is disabled temporarily, e.g., by operator error, rolling
> upgrades, etc."
> But this 30 min interval isn't mentioned anywhere else. What is it
> referring to?
>
> 2. For the actual errors:
> INVALID_RECORD : The action required is to "Log a warning to the
> application and schedule the next GetTelemetrySubscriptionsRequest to 5
> minutes". Why is this 5 minutes, and not something like PushIntervalMs? And
> also, why are we scheduling a GetTelemetrySubscriptionsRequest in this
> case, if the serialization is broken?
> UNKNOWN_SUBSCRIPTION_ID , UNSUPPORTED_COMPRESSION_TYPE : just to confirm,
> the GetTelemetrySubscriptionsRequest needs to be scheduled immediately
> after the PushTelemetry response, correct?
>
> 3. For "Subsequent GetTelemetrySubscriptionsRequests must include the
> ClientInstanceId returned in the first response, regardless of broker":
> Will a broker error be returned in case some implementation of this KIP
> violates this accidentally and sends a request with ClientInstanceId = Null
> even when it's been obtained already? Or will a new ClientInstanceId be
> returned without an error?
>
> Thanks!
>
> On Tue, Jun 13, 2023 at 8:38 PM Andrew Schofield <
> andrew_schofield_j...@outlook.com> wrote:
>
>> Hi,
>> I would like to start a new discussion thread on KIP-714: Client metrics
>> and observability.
>>
>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
>>
>> I have edited the proposal significantly to reduce the scope. The overall
>> mechanism for client metric subscriptions is unchanged, but the
>> KIP is now based on the existing client metrics, rather than introducing
>> new metrics. The purpose remains helping cluster operators
>> investigate performance problems experienced by clients without requiring
>> changes to the client application code or configuration.
>>
>> Thanks,
>> Andrew



Re: [DISCUSS] KIP-714: Client metrics and observability

2023-07-17 Thread Milind Luthra
Hi Andrew, thanks for this KIP.

I had a few questions regarding the "Error handling" section.

1. It mentions that "The 5 and 30 minute retries are to eventually trigger
a retry and avoid having to restart clients if the cluster metrics
configuration is disabled temporarily, e.g., by operator error, rolling
upgrades, etc."
But this 30 min interval isn't mentioned anywhere else. What is it
referring to?

2. For the actual errors:
INVALID_RECORD : The action required is to "Log a warning to the
application and schedule the next GetTelemetrySubscriptionsRequest to 5
minutes". Why is this 5 minutes, and not something like PushIntervalMs? And
also, why are we scheduling a GetTelemetrySubscriptionsRequest in this
case, if the serialization is broken?
UNKNOWN_SUBSCRIPTION_ID , UNSUPPORTED_COMPRESSION_TYPE : just to confirm,
the GetTelemetrySubscriptionsRequest needs to be scheduled immediately
after the PushTelemetry response, correct?

3. For "Subsequent GetTelemetrySubscriptionsRequests must include the
ClientInstanceId returned in the first response, regardless of broker":
Will a broker error be returned in case some implementation of this KIP
violates this accidentally and sends a request with ClientInstanceId = Null
even when it's been obtained already? Or will a new ClientInstanceId be
returned without an error?

Thanks!

On Tue, Jun 13, 2023 at 8:38 PM Andrew Schofield <
andrew_schofield_j...@outlook.com> wrote:

> Hi,
> I would like to start a new discussion thread on KIP-714: Client metrics
> and observability.
>
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
>
> I have edited the proposal significantly to reduce the scope. The overall
> mechanism for client metric subscriptions is unchanged, but the
> KIP is now based on the existing client metrics, rather than introducing
> new metrics. The purpose remains helping cluster operators
> investigate performance problems experienced by clients without requiring
> changes to the client application code or configuration.
>
> Thanks,
> Andrew


Re: [DISCUSS] KIP-714: Client metrics and observability

2023-06-26 Thread Andrew Schofield
Hi Kirk,
Thanks for your comments.

1) I agree that this KIP is not aiming to create a new first-class construct of 
a unique, managed, per-client instance ID.
I’ll add some clarifying words.

2) I can see what the KIP is trying to say about the Terminating flag, but it 
doesn’t quite do it for me. Essentially,
a terminating client with an active metrics subscription can send a final 
metrics push without waiting for the
interval to elapse. However, there’s a chance that the subscription has changed 
and the PushTelemetry RPC fails
with INVALID_SUBSCRIPTION_ID. Then, the client is supposed to get a new 
subscription ID and presumably
send its terminating metrics with this new ID without waiting for the push 
interval to elapse. I will update the text.

3) The KIP is not explicit about the regular expression matcher for matching 
client selectors. I will change it to
call out RE2/J in line with KIP-848. This is also a user-provided, server-side 
regular expression match.

4) I think you’re right about the inclusion of temporality in the 
GetTelemetrySubscriptions response. A client would
be expected to support both cumulative or delta, although initially the broker 
will only use delta. However, it’s quite
an important part of the OTLP metrics specification. I think there is benefit 
in supporting this in KIP-714 clients
to enable temporality to be used by brokers without requiring another round of 
client version upgrades.

5) The ClientTelemetry interface gives a level of indirection between the 
MetricsReporter and the
ClientTelemetryReceiver. A MetricsReporter could implement 
ClientTelemetryReceiver directly, but
the implementation of the CTR could equally well be a separate object.

Thanks for helping to tighten up the KIP.

Andrew

> On 20 Jun 2023, at 16:47, Kirk True  wrote:
>
> Hi Andrew,
>
>
>
>> On Jun 13, 2023, at 8:06 AM, Andrew Schofield 
>>  wrote:
>>
>> Hi,
>> I would like to start a new discussion thread on KIP-714: Client metrics and 
>> observability.
>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
>>
>> I have edited the proposal significantly to reduce the scope. The overall 
>> mechanism for client metric subscriptions is unchanged, but the
>> KIP is now based on the existing client metrics, rather than introducing new 
>> metrics. The purpose remains helping cluster operators
>> investigate performance problems experienced by clients without requiring 
>> changes to the client application code or configuration.
>>
>> Thanks,
>> Andrew
>
> Thanks for the KIP updates. A few questions:
>
> 1. The concept of a client instance ID is somewhat similar to the unique 
> producer ID that is created for transactional producers. Can we augment the 
> name or somehow clarify that this client instance ID is only for use by 
> telemetry? The pandora’s box alternative is to make the creation, management, 
> etc. of a unique, per-client instance ID a first-class construct. I assume 
> that’s something we don’t want to bite off in this KIP ;)
>
> 2. I’m having trouble understanding where this provision for the terminating 
> flag would be useful:
>
>> The Terminating flag may be reused upon the next expiry of PushIntervalMs.
>
> In the happy path, the terminating flag is set once at time of application 
> shutdown by the close() method of a client. A buggy/nefarious client may send 
> multiple push telemetry requests with the terminating flag set to skirt 
> throttling. What’s the use case where an application would want to send a 
> second request with the terminating flag set after PushInteralMs?
>
> 3. KIP-848 introduces a new flavor of regex for topic subscriptions. Is that 
> what we plan to adopt for the regex used by the subscription match?
>
> 4. What’s the benefit of having the broker specify the delta temporality if 
> it’s (for now) always delta, besides API protocol bumping?
>
> 5. What is gained by the existence of the ClientTelemetry interface? Why not 
> let interested parties implement ClientTelemetryReceiver directly?
>
> Thanks!




Re: [DISCUSS] KIP-714: Client metrics and observability

2023-06-20 Thread Kirk True
Hi Andrew,



> On Jun 13, 2023, at 8:06 AM, Andrew Schofield 
>  wrote:
> 
> Hi,
> I would like to start a new discussion thread on KIP-714: Client metrics and 
> observability.
> 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
> 
> I have edited the proposal significantly to reduce the scope. The overall 
> mechanism for client metric subscriptions is unchanged, but the
> KIP is now based on the existing client metrics, rather than introducing new 
> metrics. The purpose remains helping cluster operators
> investigate performance problems experienced by clients without requiring 
> changes to the client application code or configuration.
> 
> Thanks,
> Andrew

Thanks for the KIP updates. A few questions:

1. The concept of a client instance ID is somewhat similar to the unique 
producer ID that is created for transactional producers. Can we augment the 
name or somehow clarify that this client instance ID is only for use by 
telemetry? The pandora’s box alternative is to make the creation, management, 
etc. of a unique, per-client instance ID a first-class construct. I assume 
that’s something we don’t want to bite off in this KIP ;)

2. I’m having trouble understanding where this provision for the terminating 
flag would be useful:

> The Terminating flag may be reused upon the next expiry of PushIntervalMs.

In the happy path, the terminating flag is set once at time of application 
shutdown by the close() method of a client. A buggy/nefarious client may send 
multiple push telemetry requests with the terminating flag set to skirt 
throttling. What’s the use case where an application would want to send a 
second request with the terminating flag set after PushInteralMs?

3. KIP-848 introduces a new flavor of regex for topic subscriptions. Is that 
what we plan to adopt for the regex used by the subscription match?

4. What’s the benefit of having the broker specify the delta temporality if 
it’s (for now) always delta, besides API protocol bumping?

5. What is gained by the existence of the ClientTelemetry interface? Why not 
let interested parties implement ClientTelemetryReceiver directly?

Thanks!

Re: [DISCUSS] KIP-714: Client metrics and observability

2022-06-23 Thread Kirk True
Hi Jun,

On Tue, Jun 21, 2022, at 5:24 PM, Jun Rao wrote:
> Hi, Magnus, Kirk,
> 
> Thanks for the reply. A few more comments on your reply.
> 
> 100. I agree there are some benefits of having a set of standard metrics
> across all clients, but I am just wondering how practical it is, given that
> the proposal doesn't require this set like the Kafka protocol.
> 100.1 A client may not implement all or some of the standard metrics. Then,
> we won't have complete standardized names across clients.

True, a client need not implement all the metrics from the KIP. However, those 
that it does implement will use the names specified in the KIP. The rest of the 
metrics that a client doesn't implement should be considered as "reserved for 
future use."

> 100.2 The set of standard metrics needs to be common across all clients.
> For example, client.consumer.poll.latency implies that all clients
> implement a poll() interface. Is that true for all clients?
> client.producer.record.queue.bytes. Do all producers have queues? We
> probably need to make a pass of those metrics to see if they are indeed
> common across all clients.

There are certainly metrics that are not applicable for all client 
implementations. For example, some of the host-specific CPU timing metrics are 
"hard" to get on a JVM using standard Java APIs. Ultimately the client author 
must make a judgement call whether or not to implement a metric. If a given 
metric from the KIP is truly non-applicable for a client, the author would 
likely omit it from the client.

Regarding the request to "make a pass" of the clients, are there any client 
implementations in particular that I should consider reviewing?

I will make an effort to look at some of the more common clients to determine 
which metrics they expose. I'm a little concerned that could take on outsized 
amount of effort, depending on the clients' documentation. Researching the code 
base of each client to ascertain the exposed metrics sounds very time consuming.

> Also, a bunch of standard metrics have type
> Histogram. Java client doesn't have good Histogram support yet. I am also
> not sure if all clients support Histogram. Should we avoid Histogram type
> in standardized metrics?

That's a good question. I can try to get a feel for the existing histogram 
support in the ecosystem clients and report back.

The KIP does specify an alternate means to report histogram data using 
time-based averages:

"For [simplicity] a client implementation may choose to provide an average 
value as [a] Gauge instead of a Histogram. These averages should be using the 
original Histogram metric name + '.avg', e.g., 'client.request.rtt.avg'."

This approach offers lower fidelity, of course, but it's hopefully more useful 
in general to have _some_ data than _no_ data?

Perhaps we should replace histograms with this simplified implementation in the 
KIP, deferring proper histogram support to a future revision?

> 100.3 For a subset of metrics that are truly common across clients, it
> would be confusing for each client to maintain two sets of metrics for the
> same thing. We could document them, but that means every user of every
> client needs to remember this mapping. This is a much bigger
> inconvenience than standardizing the metric names on the server side. If we
> want to go this route, my preference is to deprecate the existing metric
> names that are covered by the standard metric names.

Ah, good point. I admit my focus is too Java-centric.

I want to make sure I understand more specifically what "the server" is in your 
point regarding 'standardizing the metric names on the server.' At some point 
there needs to be code that executes on the server that has knowledge of all 
the clients' metric names as well as a given organization's preferred metric 
names. Would this code live in the main Apache Kafka repo? Or is it in the 
organization's ClientTelemetryReceiver implementation? Or somewhere else?

How about introducing a new pluggable mechanism/interface that the broker 
invokes to determine the metric name mapping? We could provide two 
out-of-the-box implementations: 1) a default no-op mapper, and 2) a 
configuration file-based mapper that operates off something akin to a set of 
Java properties files (one mapping file for each known client). The 
implementation of the mapper is configured by the cluster administrator and, of 
course, each organization can provide their own implementation.

> 101. "or if the client-specific metrics are not converted to some common
> form, name, semantic, etc, it'll make creating meaningful aggregations and
> monitoring more complex in the upstream telemetry system with a scattered
> plethora of custom metrics." There will always be client specific metrics.
> So, it seems that we have to deal with scattered custom metrics even with a
> set of standard metrics.

Yes, this is true.

I do believe the KIP should establish a clear means to communicate about the 

Re: [DISCUSS] KIP-714: Client metrics and observability

2022-06-21 Thread Jun Rao
Hi, Magnus, Kirk,

Thanks for the reply. A few more comments on your reply.

100. I agree there are some benefits of having a set of standard metrics
across all clients, but I am just wondering how practical it is, given that
the proposal doesn't require this set like the Kafka protocol.
100.1 A client may not implement all or some of the standard metrics. Then,
we won't have complete standardized names across clients.
100.2 The set of standard metrics needs to be common across all clients.
For example, client.consumer.poll.latency implies that all clients
implement a poll() interface. Is that true for all clients?
client.producer.record.queue.bytes. Do all producers have queues? We
probably need to make a pass of those metrics to see if they are indeed
common across all clients. Also, a bunch of standard metrics have type
Histogram. Java client doesn't have good Histogram support yet. I am also
not sure if all clients support Histogram. Should we avoid Histogram type
in standardized metrics?
100.3 For a subset of metrics that are truly common across clients, it
would be confusing for each client to maintain two sets of metrics for the
same thing. We could document them, but that means every user of every
client needs to remember this mapping. This is a much bigger
inconvenience than standardizing the metric names on the server side. If we
want to go this route, my preference is to deprecate the existing metric
names that are covered by the standard metric names.

101. "or if the client-specific metrics are not converted to some common
form, name, semantic, etc, it'll make creating meaningful aggregations and
monitoring more complex in the upstream telemetry system with a scattered
plethora of custom metrics." There will always be client specific metrics.
So, it seems that we have to deal with scattered custom metrics even with a
set of standard metrics.

102. "However, in, let's say the Apache Kafka 3.7 release, the metric name
is changed to "connections.open.count." At this point, there are two names
and machine-to-machine communication will likely be effected. With that
change, all client telemetry plugin(s) used in an organization must be
updated to reflect that change, else data loss or bugs could be
introduced." The standard metric names could change too in the future,
right? So, we need to deal with a similar problem if that happens.

103. "Are there any inobvious security/privacy-related edge cases where
shipping certain metrics to the broker would be "bad?"" I am not sure. But
if a metric can be shipped to the server, it would be useful for the same
metric to be visible on the client side.

Thanks,

Jun


On Tue, Jun 21, 2022 at 8:19 AM Kirk True  wrote:

> Hi Jun,
>
> Thank you for all your continued interest in shaping the KIP :)
>
> On Thu, Jun 16, 2022, at 2:38 PM, Jun Rao wrote:
> > Hi, Kirk,
> >
> > Thanks for the reply. A couple of more comments.
> >
> > (1) "Another perspective is that these two sets of metrics serve
> different
> > purposes and/or have different audiences, which argues that they should
> > maintain their individuality and purpose. " Hmm, I am wondering if those
> > metrics are really for different audiences and purposes? For example, if
> > the operator detected an issue through a client metric collected through
> > the server, the operator may need to communicate that back to the client.
> > It would be weird if that same metric is not visible on the client side.
>
> I agree in the principal that all client metrics visible on the client can
> also be available to be sent to the broker.
>
> Are there any inobvious security/privacy-related edge cases where shipping
> certain metrics to the broker would be "bad?"
>
> > (2) If we could standardize the names on the server side, do we need to
> > enforce a naming convention for all clients?
>
> "Enforce" is such an ugly word :P
>
> But yes, I do feel that a consistent naming convention across all clients
> provides communication benefits between two entities:
>
>  1. Human-to-human communication. Ecosystem-wide agreement and
> understanding of metrics helps all to communicate more efficiently.
>  2. Machine-to-machine communication. Defining the names via the KIP
> mechanism help to ensure stabilization across releases of a given client.
>
> Point 1: Human-to-human Communication
>
> There are quite a handful of parties that must communicate effectively
> across the Kafka ecosystem. Here are the ones I can think of off the top of
> my head:
>
>  1. Kafka client authors
>  2. Kafka client users
>  3. Kafka client telemetry plugin authors
>  4. Support teams (within an organization or vendor-supplied across
> organizations)
>  5. Kafka cluster operators
>
> There should be a standard so that these parties can understand the
> metrics' meaning and be able to correlate that across all clients.
>
> As a concrete example, KIP-714 includes a metric for tracking the number
> of active client connections to a cluster, named
> 

Re: [DISCUSS] KIP-714: Client metrics and observability

2022-06-21 Thread Kirk True
Hi Jun,

Thank you for all your continued interest in shaping the KIP :)

On Thu, Jun 16, 2022, at 2:38 PM, Jun Rao wrote:
> Hi, Kirk,
> 
> Thanks for the reply. A couple of more comments.
> 
> (1) "Another perspective is that these two sets of metrics serve different
> purposes and/or have different audiences, which argues that they should
> maintain their individuality and purpose. " Hmm, I am wondering if those
> metrics are really for different audiences and purposes? For example, if
> the operator detected an issue through a client metric collected through
> the server, the operator may need to communicate that back to the client.
> It would be weird if that same metric is not visible on the client side.

I agree in the principal that all client metrics visible on the client can also 
be available to be sent to the broker.

Are there any inobvious security/privacy-related edge cases where shipping 
certain metrics to the broker would be "bad?"

> (2) If we could standardize the names on the server side, do we need to
> enforce a naming convention for all clients?

"Enforce" is such an ugly word :P

But yes, I do feel that a consistent naming convention across all clients 
provides communication benefits between two entities:

 1. Human-to-human communication. Ecosystem-wide agreement and understanding of 
metrics helps all to communicate more efficiently.
 2. Machine-to-machine communication. Defining the names via the KIP mechanism 
help to ensure stabilization across releases of a given client.

Point 1: Human-to-human Communication

There are quite a handful of parties that must communicate effectively across 
the Kafka ecosystem. Here are the ones I can think of off the top of my head:

 1. Kafka client authors
 2. Kafka client users
 3. Kafka client telemetry plugin authors
 4. Support teams (within an organization or vendor-supplied across 
organizations)
 5. Kafka cluster operators

There should be a standard so that these parties can understand the metrics' 
meaning and be able to correlate that across all clients.

As a concrete example, KIP-714 includes a metric for tracking the number of 
active client connections to a cluster, named 
"org.apache.kafka.client.connection.active." Given this name, all client 
implementations can communicate this name and its value to all parties 
consistently. Without a standard naming convention, the metric might be named 
"connections.open" in the Java client and "Connections/Alive" in librdkafka. 
This inconsistency of naming would impact the discussions between one or more 
of the parties involved.

To your point, it's absolutely a design choice to keep the naming convention 
the same between each client. We can change that if it makes sense.

Point 2: Machine-to-machine Communication

Standardization at the client level provides stability through an implied 
contract that a client should not introduce a breaking name change between 
releases. Otherwise, the ability for the metrics to be "understood" in a 
machine-to-machine context would be forfeit.

For example, let's say that we give the clients the latitude to name metrics as 
they wish. In this example, let's say that the Apache Kafka 3.4 release decides 
to name this metric "connections.open." It's a good name! It says what it is. 
However, in, let's say the Apache Kafka 3.7 release, the metric name is changed 
to "connections.open.count." At this point, there are two names and 
machine-to-machine communication will likely be effected. With that change, all 
client telemetry plugin(s) used in an organization must be updated to reflect 
that change, else data loss or bugs could be introduced.

That the KIP defines the names of the metrics does, admittedly, constrain the 
options of authors of the different clients. The metric named 
"org.apache.kafka.client.connection.active" may be confusing in some client 
implementations. For whatever reason, a client author may even find it 
"undesirable" to include a reference that includes "Apache" in their code.

There's also the precedent set by the existing (JMX-based) client metrics. 
Though these are applicable only to the Java client, we can see that having a 
standardized naming convention there has helped with communication.

So, IMO, it makes sense to define the metric names via the KIP mechanism 
and--let's say, "ask"--that client implementations abide by those.

Thanks,
Kirk

> 
> Thanks,
> 
> Jun
> 
> On Thu, Jun 16, 2022 at 12:00 PM Kirk True  wrote:
> 
> > Hi Jun,
> >
> > I'll try to answer the questions posed...
> >
> > On Tue, Jun 7, 2022, at 4:32 PM, Jun Rao wrote:
> > > Hi, Magnus,
> > >
> > > Thanks for the reply.
> > >
> > > So, the standard set of generic metrics is just a recommendation and not
> > a
> > > requirement? This sounds good to me since it makes the adoption of the
> > KIP
> > > easier.
> >
> > I believe that was the intent, yes.
> >
> > > Regarding the metric names, I have two concerns.
> >
> > (I'm splitting these two up for 

Re: [DISCUSS] KIP-714: Client metrics and observability

2022-06-21 Thread Magnus Edenhill
Hey Jun and Kirk,


I see that there's a lot of focus on the existing metrics in the Java
clients, which makes sense,
but the KIP aims to approach the problem space from a higher and more
generic level by
defining:
1) a standard protocol for subscribing to, and pushing metrics,
2) an existing industry standard encoding and semantics for those metrics
(OTLP),
3) as well as a standard set of metrics that we believe are relevant to
most/all client implementations


The counter-alternative to these points, which have come up before in
various forms during the KIP discussions (see rejected alternatives) in the
KIP are:
1) use an existing out-of-band protocol,
2) use Kafka protocol encoding for the metrics,
3) let each client implementation provide their own set of metrics.

So why is the KIP not suggesting this approach? Well, in short:
 1) defies the zero-conf/always-available requirement - clients, networks,
firewalls, etc, must be specifically configured - which will not be
feasible.
 2) we would need to duplicate the work of the industry leading telemetry
people (opentelemetry) - reaping no benefits of their existing and future
work, and making integration with upstream telemetry systems harder,
 3a) these client-specific metrics would either need to be converted to
some common form - which is not only cpu/memory costly - but also hard from
an operational standpoint:
 someone, is it the kafka operator?, would need to understand what
client-specific metrics are available and what their semantics are - and
then for each such client implementation write translation code in the
broker-side plugin to try to mangle the custom metrics into a standard set
of metrics that can be monitored with a single upstream metric. With seven
or eight different client implementations in the wild, all with new
releases coming out every now and then some perhaps without per-metric
documentation, well that just seems like a daunting task that will be hard
to win.
 3b) or if the client-specific metrics are not converted to some common
form, name, semantic, etc, it'll make creating meaningful aggregations and
monitoring more complex in the upstream telemetry system with a scattered
plethora of custom metrics.

Additionally, the proposed standard set of metrics are derived from what is
available in existing clients and while the fit may not be perfect to
existing metrics, they won't be too off.
More so, having a standard set of metrics to implement makes it easier for
client maintainers to know which metrics they should expose and are
considered relevant to monitoring and troubleshooting.

As for manually mapping KIP-714 metric names to JMX during troubleshooting;
I agree that is not perfect but could be solved quite easily through
documentation. E.g,, "MetricA is also known as metric.foo.a in OTLP".

Another point worth mentioning is that, while the KIP does not cover it, a
future enhancement to the clients is to also expose the OTLP metrics
directly to the application as an alternative to JMX (or whatever the
client currently exposes, e.g. JSON), which makes integration with upstream
metrics systems easier.


Thanks,
Magnus







Den tors 16 juni 2022 kl 23:38 skrev Jun Rao :

> Hi, Kirk,
>
> Thanks for the reply. A couple of more comments.
>
> (1) "Another perspective is that these two sets of metrics serve different
> purposes and/or have different audiences, which argues that they should
> maintain their individuality and purpose. " Hmm, I am wondering if those
> metrics are really for different audiences and purposes? For example, if
> the operator detected an issue through a client metric collected through
> the server, the operator may need to communicate that back to the client.
> It would be weird if that same metric is not visible on the client side.
>
> (2) If we could standardize the names on the server side, do we need to
> enforce a naming convention for all clients?


> Thanks,
>
> Jun
>
> On Thu, Jun 16, 2022 at 12:00 PM Kirk True  wrote:
>
> > Hi Jun,
> >
> > I'll try to answer the questions posed...
> >
> > On Tue, Jun 7, 2022, at 4:32 PM, Jun Rao wrote:
> > > Hi, Magnus,
> > >
> > > Thanks for the reply.
> > >
> > > So, the standard set of generic metrics is just a recommendation and
> not
> > a
> > > requirement? This sounds good to me since it makes the adoption of the
> > KIP
> > > easier.
> >
> > I believe that was the intent, yes.
> >
> > > Regarding the metric names, I have two concerns.
> >
> > (I'm splitting these two up for readability...)
> >
> > > (1) If a client already
> > > has an existing metric similar to the standard one, duplicating the
> > metric
> > > seems to be confusing.
> >
> > Agreed. I'm dealing with that situation as I write the Java client
> > implementation.
> >
> > The existing Java client exposes a set of metrics via JMX. The updated
> > Java client will introduce a second set of metrics, which instead are
> > exposed via sending them to the broker. There is substantial overlap 

Re: [DISCUSS] KIP-714: Client metrics and observability

2022-06-16 Thread Jun Rao
Hi, Kirk,

Thanks for the reply. A couple of more comments.

(1) "Another perspective is that these two sets of metrics serve different
purposes and/or have different audiences, which argues that they should
maintain their individuality and purpose. " Hmm, I am wondering if those
metrics are really for different audiences and purposes? For example, if
the operator detected an issue through a client metric collected through
the server, the operator may need to communicate that back to the client.
It would be weird if that same metric is not visible on the client side.

(2) If we could standardize the names on the server side, do we need to
enforce a naming convention for all clients?

Thanks,

Jun

On Thu, Jun 16, 2022 at 12:00 PM Kirk True  wrote:

> Hi Jun,
>
> I'll try to answer the questions posed...
>
> On Tue, Jun 7, 2022, at 4:32 PM, Jun Rao wrote:
> > Hi, Magnus,
> >
> > Thanks for the reply.
> >
> > So, the standard set of generic metrics is just a recommendation and not
> a
> > requirement? This sounds good to me since it makes the adoption of the
> KIP
> > easier.
>
> I believe that was the intent, yes.
>
> > Regarding the metric names, I have two concerns.
>
> (I'm splitting these two up for readability...)
>
> > (1) If a client already
> > has an existing metric similar to the standard one, duplicating the
> metric
> > seems to be confusing.
>
> Agreed. I'm dealing with that situation as I write the Java client
> implementation.
>
> The existing Java client exposes a set of metrics via JMX. The updated
> Java client will introduce a second set of metrics, which instead are
> exposed via sending them to the broker. There is substantial overlap with
> the two set of metrics and in a few places in the code under development,
> there are essentially two separate calls to update metrics: one for the
> JMX-bound metrics and one for the broker-bound metrics.
>
> To be candid, I have gone back-and-forth on that design. From one
> perspective, it could be argued that the set of client metrics should be
> standardized across a given client, regardless of how those metrics are
> exposed for consumption. Another perspective is that these two sets of
> metrics serve different purposes and/or have different audiences, which
> argues that they should maintain their individuality and purpose. Your
> inputs/suggestions are certainly welcome!
>
> > (2) If a client needs to implement a standard metric
> > that doesn't exist yet, using a naming convention (e.g., using dash vs
> dot)
> > different from other existing metrics also seems a bit confusing. It
> seems
> > that the main benefit of having standard metric names across clients is
> for
> > better server side monitoring. Could we do the standardization in the
> > plugin on the server?
>
> I think the expectation is that the plugin implementation will perform
> transformation of metric names, if needed, to fit in with an organization's
> monitoring naming standards. Perhaps we need to call that out in the KIP
> itself.
>
> Thanks,
> Kirk
>
> >
> > Thanks,
> >
> > Jun
> >
> >
> >
> > On Tue, Jun 7, 2022 at 6:53 AM Magnus Edenhill 
> wrote:
> >
> > > Hey Jun,
> > >
> > > I've clarified the scope of the standard metrics in the KIP, but
> basically:
> > >
> > >  * We define a standard set of generic metrics that should be relevant
> to
> > > most client implementations, e.g., each producer implementation
> probably
> > > has some sort of per-partition message queue.
> > >  * A client implementation should strive to implement as many of the
> > > standard metrics as possible, but only the ones that make sense.
> > >  * For metrics that are not in the standard set, a client maintainer
> can
> > > choose to either submit a KIP to add additional standard metrics - if
> > > they're relevant, or go ahead and add custom metrics that are specific
> to
> > > that client implementation. These custom metrics will have a prefix
> > > specific to that client implementation, as opposed to the standard
> metric
> > > set that resides under "org.apache.kafka...". E.g.,
> > > "se.edenhill.librdkafka" or whatever.
> > >  * Existing non-KIP-714 metrics should remain untouched. In some cases
> we
> > > might be able to use the same meter given it is compatible with the
> > > standard metric set definition, in other cases a semi-duplicate meter
> may
> > > be needed. Thus this will not affect the metrics exposed through JMX,
> or
> > > vice versa.
> > >
> > > Thanks,
> > > Magnus
> > >
> > >
> > >
> > > Den ons 1 juni 2022 kl 18:55 skrev Jun Rao :
> > >
> > > > Hi, Magnus,
> > > >
> > > > 51. Just to clarify my question.  (1) Are standard metrics required
> for
> > > > every client for this KIP to function?  (2) Are we converting
> existing
> > > java
> > > > metrics to the standard metrics and deprecating the old ones? If so,
> > > could
> > > > we list all existing java metrics that need to be renamed and the
> > > > corresponding new name?
> > > >
> > > > Thanks,
> > > >
> > > > 

Re: [DISCUSS] KIP-714: Client metrics and observability

2022-06-16 Thread Kirk True
Hi Jun,

I'll try to answer the questions posed...

On Tue, Jun 7, 2022, at 4:32 PM, Jun Rao wrote:
> Hi, Magnus,
> 
> Thanks for the reply.
> 
> So, the standard set of generic metrics is just a recommendation and not a
> requirement? This sounds good to me since it makes the adoption of the KIP
> easier.

I believe that was the intent, yes.

> Regarding the metric names, I have two concerns.

(I'm splitting these two up for readability...)

> (1) If a client already
> has an existing metric similar to the standard one, duplicating the metric
> seems to be confusing.

Agreed. I'm dealing with that situation as I write the Java client 
implementation.

The existing Java client exposes a set of metrics via JMX. The updated Java 
client will introduce a second set of metrics, which instead are exposed via 
sending them to the broker. There is substantial overlap with the two set of 
metrics and in a few places in the code under development, there are 
essentially two separate calls to update metrics: one for the JMX-bound metrics 
and one for the broker-bound metrics.

To be candid, I have gone back-and-forth on that design. From one perspective, 
it could be argued that the set of client metrics should be standardized across 
a given client, regardless of how those metrics are exposed for consumption. 
Another perspective is that these two sets of metrics serve different purposes 
and/or have different audiences, which argues that they should maintain their 
individuality and purpose. Your inputs/suggestions are certainly welcome! 

> (2) If a client needs to implement a standard metric
> that doesn't exist yet, using a naming convention (e.g., using dash vs dot)
> different from other existing metrics also seems a bit confusing. It seems
> that the main benefit of having standard metric names across clients is for
> better server side monitoring. Could we do the standardization in the
> plugin on the server?

I think the expectation is that the plugin implementation will perform 
transformation of metric names, if needed, to fit in with an organization's 
monitoring naming standards. Perhaps we need to call that out in the KIP itself.

Thanks,
Kirk

> 
> Thanks,
> 
> Jun
> 
> 
> 
> On Tue, Jun 7, 2022 at 6:53 AM Magnus Edenhill  wrote:
> 
> > Hey Jun,
> >
> > I've clarified the scope of the standard metrics in the KIP, but basically:
> >
> >  * We define a standard set of generic metrics that should be relevant to
> > most client implementations, e.g., each producer implementation probably
> > has some sort of per-partition message queue.
> >  * A client implementation should strive to implement as many of the
> > standard metrics as possible, but only the ones that make sense.
> >  * For metrics that are not in the standard set, a client maintainer can
> > choose to either submit a KIP to add additional standard metrics - if
> > they're relevant, or go ahead and add custom metrics that are specific to
> > that client implementation. These custom metrics will have a prefix
> > specific to that client implementation, as opposed to the standard metric
> > set that resides under "org.apache.kafka...". E.g.,
> > "se.edenhill.librdkafka" or whatever.
> >  * Existing non-KIP-714 metrics should remain untouched. In some cases we
> > might be able to use the same meter given it is compatible with the
> > standard metric set definition, in other cases a semi-duplicate meter may
> > be needed. Thus this will not affect the metrics exposed through JMX, or
> > vice versa.
> >
> > Thanks,
> > Magnus
> >
> >
> >
> > Den ons 1 juni 2022 kl 18:55 skrev Jun Rao :
> >
> > > Hi, Magnus,
> > >
> > > 51. Just to clarify my question.  (1) Are standard metrics required for
> > > every client for this KIP to function?  (2) Are we converting existing
> > java
> > > metrics to the standard metrics and deprecating the old ones? If so,
> > could
> > > we list all existing java metrics that need to be renamed and the
> > > corresponding new name?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Tue, May 31, 2022 at 3:29 PM Jun Rao  wrote:
> > >
> > > > Hi, Magnus,
> > > >
> > > > Thanks for the reply.
> > > >
> > > > 51. I think it's fine to have a list of recommended metrics for every
> > > > client to implement. I am just not sure that standardizing on the
> > metric
> > > > names across all clients is practical. The list of common metrics in
> > the
> > > > KIP have completely different names from the java metric names. Some of
> > > > them have different types. For example, some of the common metrics
> > have a
> > > > type of histogram, but the java client metrics don't use histogram in
> > > > general. Requiring the operator to translate those names and understand
> > > the
> > > > subtle differences across clients seem to cause more confusion during
> > > > troubleshooting.
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Tue, May 31, 2022 at 5:02 AM Magnus Edenhill 
> > > > wrote:
> > > >
> > > >> Den fre 20 

Re: [DISCUSS] KIP-714: Client metrics and observability

2022-06-07 Thread Jun Rao
Hi, Magnus,

Thanks for the reply.

So, the standard set of generic metrics is just a recommendation and not a
requirement? This sounds good to me since it makes the adoption of the KIP
easier.

Regarding the metric names, I have two concerns. (1) If a client already
has an existing metric similar to the standard one, duplicating the metric
seems to be confusing. (2) If a client needs to implement a standard metric
that doesn't exist yet, using a naming convention (e.g., using dash vs dot)
different from other existing metrics also seems a bit confusing. It seems
that the main benefit of having standard metric names across clients is for
better server side monitoring. Could we do the standardization in the
plugin on the server?

Thanks,

Jun



On Tue, Jun 7, 2022 at 6:53 AM Magnus Edenhill  wrote:

> Hey Jun,
>
> I've clarified the scope of the standard metrics in the KIP, but basically:
>
>  * We define a standard set of generic metrics that should be relevant to
> most client implementations, e.g., each producer implementation probably
> has some sort of per-partition message queue.
>  * A client implementation should strive to implement as many of the
> standard metrics as possible, but only the ones that make sense.
>  * For metrics that are not in the standard set, a client maintainer can
> choose to either submit a KIP to add additional standard metrics - if
> they're relevant, or go ahead and add custom metrics that are specific to
> that client implementation. These custom metrics will have a prefix
> specific to that client implementation, as opposed to the standard metric
> set that resides under "org.apache.kafka...". E.g.,
> "se.edenhill.librdkafka" or whatever.
>  * Existing non-KIP-714 metrics should remain untouched. In some cases we
> might be able to use the same meter given it is compatible with the
> standard metric set definition, in other cases a semi-duplicate meter may
> be needed. Thus this will not affect the metrics exposed through JMX, or
> vice versa.
>
> Thanks,
> Magnus
>
>
>
> Den ons 1 juni 2022 kl 18:55 skrev Jun Rao :
>
> > Hi, Magnus,
> >
> > 51. Just to clarify my question.  (1) Are standard metrics required for
> > every client for this KIP to function?  (2) Are we converting existing
> java
> > metrics to the standard metrics and deprecating the old ones? If so,
> could
> > we list all existing java metrics that need to be renamed and the
> > corresponding new name?
> >
> > Thanks,
> >
> > Jun
> >
> > On Tue, May 31, 2022 at 3:29 PM Jun Rao  wrote:
> >
> > > Hi, Magnus,
> > >
> > > Thanks for the reply.
> > >
> > > 51. I think it's fine to have a list of recommended metrics for every
> > > client to implement. I am just not sure that standardizing on the
> metric
> > > names across all clients is practical. The list of common metrics in
> the
> > > KIP have completely different names from the java metric names. Some of
> > > them have different types. For example, some of the common metrics
> have a
> > > type of histogram, but the java client metrics don't use histogram in
> > > general. Requiring the operator to translate those names and understand
> > the
> > > subtle differences across clients seem to cause more confusion during
> > > troubleshooting.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Tue, May 31, 2022 at 5:02 AM Magnus Edenhill 
> > > wrote:
> > >
> > >> Den fre 20 maj 2022 kl 01:23 skrev Jun Rao  >:
> > >>
> > >> > Hi, Magus,
> > >> >
> > >> > Thanks for the reply.
> > >> >
> > >> > 50. Sounds good.
> > >> >
> > >> > 51. I miss-understood the proposal in the KIP then. The proposal is
> to
> > >> > define a set of common metric names that every client should
> > implement.
> > >> The
> > >> > problem is that every client already has its own set of metrics with
> > its
> > >> > own names. I am not sure that we could easily agree upon a common
> set
> > of
> > >> > metrics that work with all clients. There are likely to be some
> > metrics
> > >> > that are client specific. Translating between the common name and
> > client
> > >> > specific name is probably going to add more confusion. As mentioned
> in
> > >> the
> > >> > KIP, similar metrics from different clients could have subtle
> > >> > semantic differences. Could we just let each client use its own set
> of
> > >> > metric names?
> > >> >
> > >>
> > >> We identified a common set of metrics that should be relevant for most
> > >> client implementations,
> > >> they're the ones listed in the KIP.
> > >> A supporting client does not have to implement all those metrics, only
> > the
> > >> ones that makes sense
> > >> based on that client implementation, and a client may implement other
> > >> metrics that are not listed
> > >> in the KIP under its own namespace.
> > >> This approach has two benefits:
> > >>  - there will be a common set of metrics that most/all clients
> > implement,
> > >> which makes monitoring
> > >>   and troubleshooting easier across fleets with multiple Kafka client
> > 

Re: [DISCUSS] KIP-714: Client metrics and observability

2022-06-07 Thread Magnus Edenhill
Hey Jun,

I've clarified the scope of the standard metrics in the KIP, but basically:

 * We define a standard set of generic metrics that should be relevant to
most client implementations, e.g., each producer implementation probably
has some sort of per-partition message queue.
 * A client implementation should strive to implement as many of the
standard metrics as possible, but only the ones that make sense.
 * For metrics that are not in the standard set, a client maintainer can
choose to either submit a KIP to add additional standard metrics - if
they're relevant, or go ahead and add custom metrics that are specific to
that client implementation. These custom metrics will have a prefix
specific to that client implementation, as opposed to the standard metric
set that resides under "org.apache.kafka...". E.g.,
"se.edenhill.librdkafka" or whatever.
 * Existing non-KIP-714 metrics should remain untouched. In some cases we
might be able to use the same meter given it is compatible with the
standard metric set definition, in other cases a semi-duplicate meter may
be needed. Thus this will not affect the metrics exposed through JMX, or
vice versa.

Thanks,
Magnus



Den ons 1 juni 2022 kl 18:55 skrev Jun Rao :

> Hi, Magnus,
>
> 51. Just to clarify my question.  (1) Are standard metrics required for
> every client for this KIP to function?  (2) Are we converting existing java
> metrics to the standard metrics and deprecating the old ones? If so, could
> we list all existing java metrics that need to be renamed and the
> corresponding new name?
>
> Thanks,
>
> Jun
>
> On Tue, May 31, 2022 at 3:29 PM Jun Rao  wrote:
>
> > Hi, Magnus,
> >
> > Thanks for the reply.
> >
> > 51. I think it's fine to have a list of recommended metrics for every
> > client to implement. I am just not sure that standardizing on the metric
> > names across all clients is practical. The list of common metrics in the
> > KIP have completely different names from the java metric names. Some of
> > them have different types. For example, some of the common metrics have a
> > type of histogram, but the java client metrics don't use histogram in
> > general. Requiring the operator to translate those names and understand
> the
> > subtle differences across clients seem to cause more confusion during
> > troubleshooting.
> >
> > Thanks,
> >
> > Jun
> >
> > On Tue, May 31, 2022 at 5:02 AM Magnus Edenhill 
> > wrote:
> >
> >> Den fre 20 maj 2022 kl 01:23 skrev Jun Rao :
> >>
> >> > Hi, Magus,
> >> >
> >> > Thanks for the reply.
> >> >
> >> > 50. Sounds good.
> >> >
> >> > 51. I miss-understood the proposal in the KIP then. The proposal is to
> >> > define a set of common metric names that every client should
> implement.
> >> The
> >> > problem is that every client already has its own set of metrics with
> its
> >> > own names. I am not sure that we could easily agree upon a common set
> of
> >> > metrics that work with all clients. There are likely to be some
> metrics
> >> > that are client specific. Translating between the common name and
> client
> >> > specific name is probably going to add more confusion. As mentioned in
> >> the
> >> > KIP, similar metrics from different clients could have subtle
> >> > semantic differences. Could we just let each client use its own set of
> >> > metric names?
> >> >
> >>
> >> We identified a common set of metrics that should be relevant for most
> >> client implementations,
> >> they're the ones listed in the KIP.
> >> A supporting client does not have to implement all those metrics, only
> the
> >> ones that makes sense
> >> based on that client implementation, and a client may implement other
> >> metrics that are not listed
> >> in the KIP under its own namespace.
> >> This approach has two benefits:
> >>  - there will be a common set of metrics that most/all clients
> implement,
> >> which makes monitoring
> >>   and troubleshooting easier across fleets with multiple Kafka client
> >> languages/implementations.
> >>  - client-specific metrics are still possible, so if there is no
> suitable
> >> standard metric a client can still
> >>provide what special metrics it has.
> >>
> >>
> >> Thanks,
> >> Magnus
> >>
> >>
> >> On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill 
> >> wrote:
> >> >
> >> > > Den ons 18 maj 2022 kl 19:57 skrev Jun Rao  >> >:
> >> > >
> >> > > > Hi, Magnus,
> >> > > >
> >> > >
> >> > > Hi Jun
> >> > >
> >> > >
> >> > > >
> >> > > > Thanks for the updated KIP. Just a couple of more comments.
> >> > > >
> >> > > > 50. To troubleshoot a particular client issue, I imagine that the
> >> > client
> >> > > > needs to identify its client_instance_id. How does the client find
> >> this
> >> > > > out? Do we plan to include client_instance_id in the client log,
> >> expose
> >> > > it
> >> > > > as a metric or something else?
> >> > > >
> >> > >
> >> > > The KIP suggests that client implementations emit an informative log
> >> > > message
> >> > > with the assigned client-instance-id 

Re: [DISCUSS] KIP-714: Client metrics and observability

2022-06-01 Thread Jun Rao
Hi, Magnus,

51. Just to clarify my question.  (1) Are standard metrics required for
every client for this KIP to function?  (2) Are we converting existing java
metrics to the standard metrics and deprecating the old ones? If so, could
we list all existing java metrics that need to be renamed and the
corresponding new name?

Thanks,

Jun

On Tue, May 31, 2022 at 3:29 PM Jun Rao  wrote:

> Hi, Magnus,
>
> Thanks for the reply.
>
> 51. I think it's fine to have a list of recommended metrics for every
> client to implement. I am just not sure that standardizing on the metric
> names across all clients is practical. The list of common metrics in the
> KIP have completely different names from the java metric names. Some of
> them have different types. For example, some of the common metrics have a
> type of histogram, but the java client metrics don't use histogram in
> general. Requiring the operator to translate those names and understand the
> subtle differences across clients seem to cause more confusion during
> troubleshooting.
>
> Thanks,
>
> Jun
>
> On Tue, May 31, 2022 at 5:02 AM Magnus Edenhill 
> wrote:
>
>> Den fre 20 maj 2022 kl 01:23 skrev Jun Rao :
>>
>> > Hi, Magus,
>> >
>> > Thanks for the reply.
>> >
>> > 50. Sounds good.
>> >
>> > 51. I miss-understood the proposal in the KIP then. The proposal is to
>> > define a set of common metric names that every client should implement.
>> The
>> > problem is that every client already has its own set of metrics with its
>> > own names. I am not sure that we could easily agree upon a common set of
>> > metrics that work with all clients. There are likely to be some metrics
>> > that are client specific. Translating between the common name and client
>> > specific name is probably going to add more confusion. As mentioned in
>> the
>> > KIP, similar metrics from different clients could have subtle
>> > semantic differences. Could we just let each client use its own set of
>> > metric names?
>> >
>>
>> We identified a common set of metrics that should be relevant for most
>> client implementations,
>> they're the ones listed in the KIP.
>> A supporting client does not have to implement all those metrics, only the
>> ones that makes sense
>> based on that client implementation, and a client may implement other
>> metrics that are not listed
>> in the KIP under its own namespace.
>> This approach has two benefits:
>>  - there will be a common set of metrics that most/all clients implement,
>> which makes monitoring
>>   and troubleshooting easier across fleets with multiple Kafka client
>> languages/implementations.
>>  - client-specific metrics are still possible, so if there is no suitable
>> standard metric a client can still
>>provide what special metrics it has.
>>
>>
>> Thanks,
>> Magnus
>>
>>
>> On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill 
>> wrote:
>> >
>> > > Den ons 18 maj 2022 kl 19:57 skrev Jun Rao > >:
>> > >
>> > > > Hi, Magnus,
>> > > >
>> > >
>> > > Hi Jun
>> > >
>> > >
>> > > >
>> > > > Thanks for the updated KIP. Just a couple of more comments.
>> > > >
>> > > > 50. To troubleshoot a particular client issue, I imagine that the
>> > client
>> > > > needs to identify its client_instance_id. How does the client find
>> this
>> > > > out? Do we plan to include client_instance_id in the client log,
>> expose
>> > > it
>> > > > as a metric or something else?
>> > > >
>> > >
>> > > The KIP suggests that client implementations emit an informative log
>> > > message
>> > > with the assigned client-instance-id once it is retrieved (once per
>> > client
>> > > instance lifetime).
>> > > There's also a clientInstanceId() method that an application can use
>> to
>> > > retrieve
>> > > the client instance id and emit through whatever side channels makes
>> > sense.
>> > >
>> > >
>> > >
>> > > > 51. The KIP lists a bunch of metrics that need to be collected at
>> the
>> > > > client side. However, it seems quite a few useful java client
>> metrics
>> > > like
>> > > > the following are missing.
>> > > > buffer-total-bytes
>> > > > buffer-available-bytes
>> > > >
>> > >
>> > > These are covered by client.producer.record.queue.bytes and
>> > > client.producer.record.queue.max.bytes.
>> > >
>> > >
>> > > > bufferpool-wait-time
>> > > >
>> > >
>> > > Missing, but somewhat implementation specific.
>> > > If it was up to me we would add this later if there's a need.
>> > >
>> > >
>> > >
>> > > > batch-size-avg
>> > > > batch-size-max
>> > > >
>> > >
>> > > These are missing and would be suitably represented as a histogram.
>> I'll
>> > > add them.
>> > >
>> > >
>> > >
>> > > > io-wait-ratio
>> > > > io-ratio
>> > > >
>> > >
>> > > There's client.io.wait.time which should cover io-wait-ratio.
>> > > We could add a client.io.time as well, now or in a later KIP.
>> > >
>> > > Thanks,
>> > > Magnus
>> > >
>> > >
>> > >
>> > >
>> > > >
>> > > > Thanks,
>> > > >
>> > > > Jun
>> > > >
>> > > > On Mon, Apr 4, 2022 at 10:01 AM Jun 

Re: [DISCUSS] KIP-714: Client metrics and observability

2022-05-31 Thread Jun Rao
Hi, Magnus,

Thanks for the reply.

51. I think it's fine to have a list of recommended metrics for every
client to implement. I am just not sure that standardizing on the metric
names across all clients is practical. The list of common metrics in the
KIP have completely different names from the java metric names. Some of
them have different types. For example, some of the common metrics have a
type of histogram, but the java client metrics don't use histogram in
general. Requiring the operator to translate those names and understand the
subtle differences across clients seem to cause more confusion during
troubleshooting.

Thanks,

Jun

On Tue, May 31, 2022 at 5:02 AM Magnus Edenhill  wrote:

> Den fre 20 maj 2022 kl 01:23 skrev Jun Rao :
>
> > Hi, Magus,
> >
> > Thanks for the reply.
> >
> > 50. Sounds good.
> >
> > 51. I miss-understood the proposal in the KIP then. The proposal is to
> > define a set of common metric names that every client should implement.
> The
> > problem is that every client already has its own set of metrics with its
> > own names. I am not sure that we could easily agree upon a common set of
> > metrics that work with all clients. There are likely to be some metrics
> > that are client specific. Translating between the common name and client
> > specific name is probably going to add more confusion. As mentioned in
> the
> > KIP, similar metrics from different clients could have subtle
> > semantic differences. Could we just let each client use its own set of
> > metric names?
> >
>
> We identified a common set of metrics that should be relevant for most
> client implementations,
> they're the ones listed in the KIP.
> A supporting client does not have to implement all those metrics, only the
> ones that makes sense
> based on that client implementation, and a client may implement other
> metrics that are not listed
> in the KIP under its own namespace.
> This approach has two benefits:
>  - there will be a common set of metrics that most/all clients implement,
> which makes monitoring
>   and troubleshooting easier across fleets with multiple Kafka client
> languages/implementations.
>  - client-specific metrics are still possible, so if there is no suitable
> standard metric a client can still
>provide what special metrics it has.
>
>
> Thanks,
> Magnus
>
>
> On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill 
> wrote:
> >
> > > Den ons 18 maj 2022 kl 19:57 skrev Jun Rao :
> > >
> > > > Hi, Magnus,
> > > >
> > >
> > > Hi Jun
> > >
> > >
> > > >
> > > > Thanks for the updated KIP. Just a couple of more comments.
> > > >
> > > > 50. To troubleshoot a particular client issue, I imagine that the
> > client
> > > > needs to identify its client_instance_id. How does the client find
> this
> > > > out? Do we plan to include client_instance_id in the client log,
> expose
> > > it
> > > > as a metric or something else?
> > > >
> > >
> > > The KIP suggests that client implementations emit an informative log
> > > message
> > > with the assigned client-instance-id once it is retrieved (once per
> > client
> > > instance lifetime).
> > > There's also a clientInstanceId() method that an application can use to
> > > retrieve
> > > the client instance id and emit through whatever side channels makes
> > sense.
> > >
> > >
> > >
> > > > 51. The KIP lists a bunch of metrics that need to be collected at the
> > > > client side. However, it seems quite a few useful java client metrics
> > > like
> > > > the following are missing.
> > > > buffer-total-bytes
> > > > buffer-available-bytes
> > > >
> > >
> > > These are covered by client.producer.record.queue.bytes and
> > > client.producer.record.queue.max.bytes.
> > >
> > >
> > > > bufferpool-wait-time
> > > >
> > >
> > > Missing, but somewhat implementation specific.
> > > If it was up to me we would add this later if there's a need.
> > >
> > >
> > >
> > > > batch-size-avg
> > > > batch-size-max
> > > >
> > >
> > > These are missing and would be suitably represented as a histogram.
> I'll
> > > add them.
> > >
> > >
> > >
> > > > io-wait-ratio
> > > > io-ratio
> > > >
> > >
> > > There's client.io.wait.time which should cover io-wait-ratio.
> > > We could add a client.io.time as well, now or in a later KIP.
> > >
> > > Thanks,
> > > Magnus
> > >
> > >
> > >
> > >
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Mon, Apr 4, 2022 at 10:01 AM Jun Rao  wrote:
> > > >
> > > > > Hi, Xavier,
> > > > >
> > > > > Thanks for the reply.
> > > > >
> > > > > 28. It does seem that we have started using KafkaMetrics on the
> > broker
> > > > > side. Then, my only concern is on the usage of Histogram in
> > > KafkaMetrics.
> > > > > Histogram in KafkaMetrics statically divides the value space into a
> > > fixed
> > > > > number of buckets and only returns values on the bucket boundary.
> So,
> > > the
> > > > > returned histogram value may never show up in a recorded value.
> > Yammer
> > > > > Histogram, on the 

Re: [DISCUSS] KIP-714: Client metrics and observability

2022-05-31 Thread Magnus Edenhill
Den fre 20 maj 2022 kl 01:23 skrev Jun Rao :

> Hi, Magus,
>
> Thanks for the reply.
>
> 50. Sounds good.
>
> 51. I miss-understood the proposal in the KIP then. The proposal is to
> define a set of common metric names that every client should implement. The
> problem is that every client already has its own set of metrics with its
> own names. I am not sure that we could easily agree upon a common set of
> metrics that work with all clients. There are likely to be some metrics
> that are client specific. Translating between the common name and client
> specific name is probably going to add more confusion. As mentioned in the
> KIP, similar metrics from different clients could have subtle
> semantic differences. Could we just let each client use its own set of
> metric names?
>

We identified a common set of metrics that should be relevant for most
client implementations,
they're the ones listed in the KIP.
A supporting client does not have to implement all those metrics, only the
ones that makes sense
based on that client implementation, and a client may implement other
metrics that are not listed
in the KIP under its own namespace.
This approach has two benefits:
 - there will be a common set of metrics that most/all clients implement,
which makes monitoring
  and troubleshooting easier across fleets with multiple Kafka client
languages/implementations.
 - client-specific metrics are still possible, so if there is no suitable
standard metric a client can still
   provide what special metrics it has.


Thanks,
Magnus


On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill  wrote:
>
> > Den ons 18 maj 2022 kl 19:57 skrev Jun Rao :
> >
> > > Hi, Magnus,
> > >
> >
> > Hi Jun
> >
> >
> > >
> > > Thanks for the updated KIP. Just a couple of more comments.
> > >
> > > 50. To troubleshoot a particular client issue, I imagine that the
> client
> > > needs to identify its client_instance_id. How does the client find this
> > > out? Do we plan to include client_instance_id in the client log, expose
> > it
> > > as a metric or something else?
> > >
> >
> > The KIP suggests that client implementations emit an informative log
> > message
> > with the assigned client-instance-id once it is retrieved (once per
> client
> > instance lifetime).
> > There's also a clientInstanceId() method that an application can use to
> > retrieve
> > the client instance id and emit through whatever side channels makes
> sense.
> >
> >
> >
> > > 51. The KIP lists a bunch of metrics that need to be collected at the
> > > client side. However, it seems quite a few useful java client metrics
> > like
> > > the following are missing.
> > > buffer-total-bytes
> > > buffer-available-bytes
> > >
> >
> > These are covered by client.producer.record.queue.bytes and
> > client.producer.record.queue.max.bytes.
> >
> >
> > > bufferpool-wait-time
> > >
> >
> > Missing, but somewhat implementation specific.
> > If it was up to me we would add this later if there's a need.
> >
> >
> >
> > > batch-size-avg
> > > batch-size-max
> > >
> >
> > These are missing and would be suitably represented as a histogram. I'll
> > add them.
> >
> >
> >
> > > io-wait-ratio
> > > io-ratio
> > >
> >
> > There's client.io.wait.time which should cover io-wait-ratio.
> > We could add a client.io.time as well, now or in a later KIP.
> >
> > Thanks,
> > Magnus
> >
> >
> >
> >
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Mon, Apr 4, 2022 at 10:01 AM Jun Rao  wrote:
> > >
> > > > Hi, Xavier,
> > > >
> > > > Thanks for the reply.
> > > >
> > > > 28. It does seem that we have started using KafkaMetrics on the
> broker
> > > > side. Then, my only concern is on the usage of Histogram in
> > KafkaMetrics.
> > > > Histogram in KafkaMetrics statically divides the value space into a
> > fixed
> > > > number of buckets and only returns values on the bucket boundary. So,
> > the
> > > > returned histogram value may never show up in a recorded value.
> Yammer
> > > > Histogram, on the other hand, uses reservoir sampling. The reported
> > value
> > > > is always one of the recorded values. So, I am not sure that
> Histogram
> > in
> > > > KafkaMetrics is as good as Yammer Histogram.
> > > ClientMetricsPluginExportTime
> > > > uses Histogram.
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
> > > 
> > > > wrote:
> > > >
> > > >> >
> > > >> > 28. On the broker, we typically use Yammer metrics. Only for
> metrics
> > > >> that
> > > >> > depend on Kafka metric features (e.g., quota), we use the Kafka
> > > metric.
> > > >> > Yammer metrics have 4 types: gauge, meter, histogram and timer.
> > meter
> > > >> > calculates a rate, but also exposes an accumulated value.
> > > >> >
> > > >>
> > > >> I don't see a good reason we should limit ourselves to Yammer
> metrics
> > on
> > > >> the broker. KafkaMetrics was written
> > > >> to replace Yammer metrics and is used for all new components
> (clients,
> > > >> 

Re: [DISCUSS] KIP-714: Client metrics and observability

2022-05-19 Thread Jun Rao
Hi, Magus,

Thanks for the reply.

50. Sounds good.

51. I miss-understood the proposal in the KIP then. The proposal is to
define a set of common metric names that every client should implement. The
problem is that every client already has its own set of metrics with its
own names. I am not sure that we could easily agree upon a common set of
metrics that work with all clients. There are likely to be some metrics
that are client specific. Translating between the common name and client
specific name is probably going to add more confusion. As mentioned in the
KIP, similar metrics from different clients could have subtle
semantic differences. Could we just let each client use its own set of
metric names?

Thanks,

Jun

On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill  wrote:

> Den ons 18 maj 2022 kl 19:57 skrev Jun Rao :
>
> > Hi, Magnus,
> >
>
> Hi Jun
>
>
> >
> > Thanks for the updated KIP. Just a couple of more comments.
> >
> > 50. To troubleshoot a particular client issue, I imagine that the client
> > needs to identify its client_instance_id. How does the client find this
> > out? Do we plan to include client_instance_id in the client log, expose
> it
> > as a metric or something else?
> >
>
> The KIP suggests that client implementations emit an informative log
> message
> with the assigned client-instance-id once it is retrieved (once per client
> instance lifetime).
> There's also a clientInstanceId() method that an application can use to
> retrieve
> the client instance id and emit through whatever side channels makes sense.
>
>
>
> > 51. The KIP lists a bunch of metrics that need to be collected at the
> > client side. However, it seems quite a few useful java client metrics
> like
> > the following are missing.
> > buffer-total-bytes
> > buffer-available-bytes
> >
>
> These are covered by client.producer.record.queue.bytes and
> client.producer.record.queue.max.bytes.
>
>
> > bufferpool-wait-time
> >
>
> Missing, but somewhat implementation specific.
> If it was up to me we would add this later if there's a need.
>
>
>
> > batch-size-avg
> > batch-size-max
> >
>
> These are missing and would be suitably represented as a histogram. I'll
> add them.
>
>
>
> > io-wait-ratio
> > io-ratio
> >
>
> There's client.io.wait.time which should cover io-wait-ratio.
> We could add a client.io.time as well, now or in a later KIP.
>
> Thanks,
> Magnus
>
>
>
>
> >
> > Thanks,
> >
> > Jun
> >
> > On Mon, Apr 4, 2022 at 10:01 AM Jun Rao  wrote:
> >
> > > Hi, Xavier,
> > >
> > > Thanks for the reply.
> > >
> > > 28. It does seem that we have started using KafkaMetrics on the broker
> > > side. Then, my only concern is on the usage of Histogram in
> KafkaMetrics.
> > > Histogram in KafkaMetrics statically divides the value space into a
> fixed
> > > number of buckets and only returns values on the bucket boundary. So,
> the
> > > returned histogram value may never show up in a recorded value. Yammer
> > > Histogram, on the other hand, uses reservoir sampling. The reported
> value
> > > is always one of the recorded values. So, I am not sure that Histogram
> in
> > > KafkaMetrics is as good as Yammer Histogram.
> > ClientMetricsPluginExportTime
> > > uses Histogram.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
> > 
> > > wrote:
> > >
> > >> >
> > >> > 28. On the broker, we typically use Yammer metrics. Only for metrics
> > >> that
> > >> > depend on Kafka metric features (e.g., quota), we use the Kafka
> > metric.
> > >> > Yammer metrics have 4 types: gauge, meter, histogram and timer.
> meter
> > >> > calculates a rate, but also exposes an accumulated value.
> > >> >
> > >>
> > >> I don't see a good reason we should limit ourselves to Yammer metrics
> on
> > >> the broker. KafkaMetrics was written
> > >> to replace Yammer metrics and is used for all new components (clients,
> > >> streams, connect, etc.)
> > >> My understanding is that the original goal was to retire Yammer
> metrics
> > in
> > >> the broker in favor of KafkaMetrics.
> > >> We just haven't done so out of backwards compatibility concerns.
> > >> There are other broker metrics such as group coordinator, transaction
> > >> state
> > >> manager, and various socket server metrics
> > >> already using KafkaMetrics that don't need specific Kafka metric
> > features,
> > >> so I don't see why we should refrain from using
> > >> Kafka metrics on the broker unless there are real compatibility
> concerns
> > >> or
> > >> where implementation specifics could lead to confusion when comparing
> > >> metrics using different implementations.
> > >>
> > >> In my opinion we should encourage people to use KafkaMetrics going
> > forward
> > >> on the broker as well, for two reasons:
> > >> a) yammer metrics is long deprecated and no longer maintained
> > >> b) yammer metrics are much less expressive
> > >> c) we don't have a proper API to expose yammer metrics outside of JMX
> > >> 

Re: [DISCUSS] KIP-714: Client metrics and observability

2022-05-19 Thread Magnus Edenhill
Den ons 18 maj 2022 kl 19:57 skrev Jun Rao :

> Hi, Magnus,
>

Hi Jun


>
> Thanks for the updated KIP. Just a couple of more comments.
>
> 50. To troubleshoot a particular client issue, I imagine that the client
> needs to identify its client_instance_id. How does the client find this
> out? Do we plan to include client_instance_id in the client log, expose it
> as a metric or something else?
>

The KIP suggests that client implementations emit an informative log message
with the assigned client-instance-id once it is retrieved (once per client
instance lifetime).
There's also a clientInstanceId() method that an application can use to
retrieve
the client instance id and emit through whatever side channels makes sense.



> 51. The KIP lists a bunch of metrics that need to be collected at the
> client side. However, it seems quite a few useful java client metrics like
> the following are missing.
> buffer-total-bytes
> buffer-available-bytes
>

These are covered by client.producer.record.queue.bytes and
client.producer.record.queue.max.bytes.


> bufferpool-wait-time
>

Missing, but somewhat implementation specific.
If it was up to me we would add this later if there's a need.



> batch-size-avg
> batch-size-max
>

These are missing and would be suitably represented as a histogram. I'll
add them.



> io-wait-ratio
> io-ratio
>

There's client.io.wait.time which should cover io-wait-ratio.
We could add a client.io.time as well, now or in a later KIP.

Thanks,
Magnus




>
> Thanks,
>
> Jun
>
> On Mon, Apr 4, 2022 at 10:01 AM Jun Rao  wrote:
>
> > Hi, Xavier,
> >
> > Thanks for the reply.
> >
> > 28. It does seem that we have started using KafkaMetrics on the broker
> > side. Then, my only concern is on the usage of Histogram in KafkaMetrics.
> > Histogram in KafkaMetrics statically divides the value space into a fixed
> > number of buckets and only returns values on the bucket boundary. So, the
> > returned histogram value may never show up in a recorded value. Yammer
> > Histogram, on the other hand, uses reservoir sampling. The reported value
> > is always one of the recorded values. So, I am not sure that Histogram in
> > KafkaMetrics is as good as Yammer Histogram.
> ClientMetricsPluginExportTime
> > uses Histogram.
> >
> > Thanks,
> >
> > Jun
> >
> > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
> 
> > wrote:
> >
> >> >
> >> > 28. On the broker, we typically use Yammer metrics. Only for metrics
> >> that
> >> > depend on Kafka metric features (e.g., quota), we use the Kafka
> metric.
> >> > Yammer metrics have 4 types: gauge, meter, histogram and timer. meter
> >> > calculates a rate, but also exposes an accumulated value.
> >> >
> >>
> >> I don't see a good reason we should limit ourselves to Yammer metrics on
> >> the broker. KafkaMetrics was written
> >> to replace Yammer metrics and is used for all new components (clients,
> >> streams, connect, etc.)
> >> My understanding is that the original goal was to retire Yammer metrics
> in
> >> the broker in favor of KafkaMetrics.
> >> We just haven't done so out of backwards compatibility concerns.
> >> There are other broker metrics such as group coordinator, transaction
> >> state
> >> manager, and various socket server metrics
> >> already using KafkaMetrics that don't need specific Kafka metric
> features,
> >> so I don't see why we should refrain from using
> >> Kafka metrics on the broker unless there are real compatibility concerns
> >> or
> >> where implementation specifics could lead to confusion when comparing
> >> metrics using different implementations.
> >>
> >> In my opinion we should encourage people to use KafkaMetrics going
> forward
> >> on the broker as well, for two reasons:
> >> a) yammer metrics is long deprecated and no longer maintained
> >> b) yammer metrics are much less expressive
> >> c) we don't have a proper API to expose yammer metrics outside of JMX
> >> (MetricsReporter only exposes KafkaMetrics)
> >>
> >
>


Re: [DISCUSS] KIP-714: Client metrics and observability

2022-05-18 Thread Jun Rao
Hi, Magnus,

Thanks for the updated KIP. Just a couple of more comments.

50. To troubleshoot a particular client issue, I imagine that the client
needs to identify its client_instance_id. How does the client find this
out? Do we plan to include client_instance_id in the client log, expose it
as a metric or something else?

51. The KIP lists a bunch of metrics that need to be collected at the
client side. However, it seems quite a few useful java client metrics like
the following are missing.
buffer-total-bytes
buffer-available-bytes
bufferpool-wait-time
batch-size-avg
batch-size-max
io-wait-ratio
io-ratio

Thanks,

Jun

On Mon, Apr 4, 2022 at 10:01 AM Jun Rao  wrote:

> Hi, Xavier,
>
> Thanks for the reply.
>
> 28. It does seem that we have started using KafkaMetrics on the broker
> side. Then, my only concern is on the usage of Histogram in KafkaMetrics.
> Histogram in KafkaMetrics statically divides the value space into a fixed
> number of buckets and only returns values on the bucket boundary. So, the
> returned histogram value may never show up in a recorded value. Yammer
> Histogram, on the other hand, uses reservoir sampling. The reported value
> is always one of the recorded values. So, I am not sure that Histogram in
> KafkaMetrics is as good as Yammer Histogram. ClientMetricsPluginExportTime
> uses Histogram.
>
> Thanks,
>
> Jun
>
> On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté 
> wrote:
>
>> >
>> > 28. On the broker, we typically use Yammer metrics. Only for metrics
>> that
>> > depend on Kafka metric features (e.g., quota), we use the Kafka metric.
>> > Yammer metrics have 4 types: gauge, meter, histogram and timer. meter
>> > calculates a rate, but also exposes an accumulated value.
>> >
>>
>> I don't see a good reason we should limit ourselves to Yammer metrics on
>> the broker. KafkaMetrics was written
>> to replace Yammer metrics and is used for all new components (clients,
>> streams, connect, etc.)
>> My understanding is that the original goal was to retire Yammer metrics in
>> the broker in favor of KafkaMetrics.
>> We just haven't done so out of backwards compatibility concerns.
>> There are other broker metrics such as group coordinator, transaction
>> state
>> manager, and various socket server metrics
>> already using KafkaMetrics that don't need specific Kafka metric features,
>> so I don't see why we should refrain from using
>> Kafka metrics on the broker unless there are real compatibility concerns
>> or
>> where implementation specifics could lead to confusion when comparing
>> metrics using different implementations.
>>
>> In my opinion we should encourage people to use KafkaMetrics going forward
>> on the broker as well, for two reasons:
>> a) yammer metrics is long deprecated and no longer maintained
>> b) yammer metrics are much less expressive
>> c) we don't have a proper API to expose yammer metrics outside of JMX
>> (MetricsReporter only exposes KafkaMetrics)
>>
>


Re: [DISCUSS] KIP-714: Client metrics and observability

2022-04-04 Thread Jun Rao
Hi, Xavier,

Thanks for the reply.

28. It does seem that we have started using KafkaMetrics on the broker
side. Then, my only concern is on the usage of Histogram in KafkaMetrics.
Histogram in KafkaMetrics statically divides the value space into a fixed
number of buckets and only returns values on the bucket boundary. So, the
returned histogram value may never show up in a recorded value. Yammer
Histogram, on the other hand, uses reservoir sampling. The reported value
is always one of the recorded values. So, I am not sure that Histogram in
KafkaMetrics is as good as Yammer Histogram. ClientMetricsPluginExportTime
uses Histogram.

Thanks,

Jun

On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté 
wrote:

> >
> > 28. On the broker, we typically use Yammer metrics. Only for metrics that
> > depend on Kafka metric features (e.g., quota), we use the Kafka metric.
> > Yammer metrics have 4 types: gauge, meter, histogram and timer. meter
> > calculates a rate, but also exposes an accumulated value.
> >
>
> I don't see a good reason we should limit ourselves to Yammer metrics on
> the broker. KafkaMetrics was written
> to replace Yammer metrics and is used for all new components (clients,
> streams, connect, etc.)
> My understanding is that the original goal was to retire Yammer metrics in
> the broker in favor of KafkaMetrics.
> We just haven't done so out of backwards compatibility concerns.
> There are other broker metrics such as group coordinator, transaction state
> manager, and various socket server metrics
> already using KafkaMetrics that don't need specific Kafka metric features,
> so I don't see why we should refrain from using
> Kafka metrics on the broker unless there are real compatibility concerns or
> where implementation specifics could lead to confusion when comparing
> metrics using different implementations.
>
> In my opinion we should encourage people to use KafkaMetrics going forward
> on the broker as well, for two reasons:
> a) yammer metrics is long deprecated and no longer maintained
> b) yammer metrics are much less expressive
> c) we don't have a proper API to expose yammer metrics outside of JMX
> (MetricsReporter only exposes KafkaMetrics)
>


Re: [DISCUSS] KIP-714: Client metrics and observability

2022-03-31 Thread Xavier Léauté
>
> Are there cases where the metrics plugin developers would want to forward
> the compressed payload without decompressing?


The only interoperable use-case I can think of would be to forward the
payloads directly to an OpenTelemetry collector backend.
Today OTLP only mandates gzip/none compression support for gRPC and HTTP
protocols, so this might only work for a limited set
of compression formats (or no compression) out of the box.

see
https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/protocol/otlp.md#protocol-details

Maybe we could consider exposing the raw uncompressed bytes regardless of
client side compression, if someone wanted
to avoid the cost of de-serializing the payload, since there would always
be an option to forward that as-is, and let the opentelemetry collector add
tags relevant to the broker originating those client metrics.


Re: [DISCUSS] KIP-714: Client metrics and observability

2022-03-31 Thread Xavier Léauté
>
> 28. On the broker, we typically use Yammer metrics. Only for metrics that
> depend on Kafka metric features (e.g., quota), we use the Kafka metric.
> Yammer metrics have 4 types: gauge, meter, histogram and timer. meter
> calculates a rate, but also exposes an accumulated value.
>

I don't see a good reason we should limit ourselves to Yammer metrics on
the broker. KafkaMetrics was written
to replace Yammer metrics and is used for all new components (clients,
streams, connect, etc.)
My understanding is that the original goal was to retire Yammer metrics in
the broker in favor of KafkaMetrics.
We just haven't done so out of backwards compatibility concerns.
There are other broker metrics such as group coordinator, transaction state
manager, and various socket server metrics
already using KafkaMetrics that don't need specific Kafka metric features,
so I don't see why we should refrain from using
Kafka metrics on the broker unless there are real compatibility concerns or
where implementation specifics could lead to confusion when comparing
metrics using different implementations.

In my opinion we should encourage people to use KafkaMetrics going forward
on the broker as well, for two reasons:
a) yammer metrics is long deprecated and no longer maintained
b) yammer metrics are much less expressive
c) we don't have a proper API to expose yammer metrics outside of JMX
(MetricsReporter only exposes KafkaMetrics)


Re: [DISCUSS] KIP-714: Client metrics and observability

2022-03-31 Thread Magnus Edenhill
Hey Ismael,


> > The PushTelemetryRequest handler decompresses the payload before passing
> it
> > to the metrics plugin.
> > This was done to avoid having to expose a public decompression interface
> to
> > metrics plugin developers.
> >
>
> Are there cases where the metrics plugin developers would want to forward
> the compressed payload without decompressing?
>

Maybe, but most plugins probably want to either add some extra information
(e.g., from the auth context), or convert to another format, so the original
compressed blob is most likely not that interesting.
In any case the plugin will want to inspect the uncompressed metrics data to
verify it is not garbage before forwarding it upstream.

We could always add an option later to allow passing the metrics payload
verbatim if the need arises.

Thanks,
Magnus


Re: [DISCUSS] KIP-714: Client metrics and observability

2022-03-30 Thread Ismael Juma
On Wed, Mar 30, 2022 at 4:08 AM Magnus Edenhill  wrote:

> > 41. We include CompressionType in PushTelemetryRequestV0, but not in
> > ClientTelemetryPayload. How would the implementer know the compression
> type
> > for the telemetry payload?
> The PushTelemetryRequest handler decompresses the payload before passing it
> to the metrics plugin.
> This was done to avoid having to expose a public decompression interface to
> metrics plugin developers.
>

Are there cases where the metrics plugin developers would want to forward
the compressed payload without decompressing?

Ismael


Re: [DISCUSS] KIP-714: Client metrics and observability

2022-03-30 Thread Magnus Edenhill
Hey Jun,

see response inline:

Den mån 21 mars 2022 kl 19:31 skrev Jun Rao :

> Hi, Kirk, Sarat,
>
> A few more comments.
>
> 40. GetTelemetrySubscriptionsResponseV0 : RequestedMetrics Array[string]
> uses "Array[0] empty string" to represent all metrics subscribed. We had a
> similar issue with the topics field in MetadataRequest and used the
> following convention.
> In version 1 and higher, an empty array indicates "request metadata for no
> topics," and a null array is used to indicate "request metadata for all
> topics."
> Should we use the same convention in GetTelemetrySubscriptionsResponseV0?
>

Right, I considered this but chose the current design because the
subscriptions are prefix-matched,
so an empty string will automatically match everything.
It is not critical in any way, so if you feel it is better to follow the
way MetadataRequest does it, I can change it?



>
> 41. We include CompressionType in PushTelemetryRequestV0, but not in
> ClientTelemetryPayload. How would the implementer know the compression type
> for the telemetry payload?
>
>
The PushTelemetryRequest handler decompresses the payload before passing it
to the metrics plugin.
This was done to avoid having to expose a public decompression interface to
metrics plugin developers.



> 42. For blocking the metrics for certain clients in the following example,
> could you describe the corresponding config value used through the
> kafka-config command?
> kafka-client-metrics.sh --bootstrap-server $BROKERS \
>--add \
>--name 'Disabe_b69cc35a' \  # A descriptive name makes it easier to
> clean up old subscriptions.
>--match client_instance_id=b69cc35a-7a54-4790-aa69-cc2bd4ee4538 \  #
> Match this specific client instance
>--block
>

--block will set the "inteval" ConfigEntry to "0", which overrides and
disables all accumulated subscriptions for the matching client instance.


Thanks,
Magnus



> On Thu, Mar 10, 2022 at 11:57 AM Jun Rao  wrote:
>
> > Hi, Kirk, Sarat,
> >
> > Thanks for the reply.
> >
> > 28. On the broker, we typically use Yammer metrics. Only for metrics that
> > depend on Kafka metric features (e.g., quota), we use the Kafka metric.
> > Yammer metrics have 4 types: gauge, meter, histogram and timer. meter
> > calculates a rate, but also exposes an accumulated value.
> >
> > 29. The Histogram class in org.apache.kafka.common.metrics.stats was
> never
> > used in the client metrics. The implementation of Histogram only
> provides a
> > fixed number of values in the domain and may not capture the quantiles
> very
> > accurately. So, we punted on using it.
> >
> > Thanks,
> >
> > Jun
> >
> >
> >
> > On Thu, Mar 10, 2022 at 10:59 AM Sarat Kakarla
> >  wrote:
> >
> >> Jun,
> >>
> >>   >>  28. For the broker metrics, could you spell out the full metric
> name
> >>   >>   including groups, tags, etc? We typically don't add the broker_id
> >> label for
> >>   >>   broker metrics. Also, brokers use Yammer metrics, which doesn't
> >> have type
> >>   >>   Sum.
> >>
> >> Sure,  I will update the KIP-714 with the above information, will remove
> >> the broker-id label from the metrics.
> >>
> >> Regarding the type is CumulativeSum the right type to use in the place
> of
> >> Sum?
> >>
> >> Thanks
> >> Sarat
> >>
> >>
> >> On 3/8/22, 5:48 PM, "Jun Rao"  wrote:
> >>
> >> Hi, Magnus, Sarat and Xavier,
> >>
> >> Thanks for the reply. A few more comments below.
> >>
> >> 20. It seems that we are piggybacking the plugin on the
> >> existing MetricsReporter. So, this seems fine.
> >>
> >> 21. That could work. Are we requiring any additional jar dependency
> >> on the
> >> client? Or, are you suggesting that we check the runtime dependency
> >> to pick
> >> the compression codec?
> >>
> >> 28. For the broker metrics, could you spell out the full metric name
> >> including groups, tags, etc? We typically don't add the broker_id
> >> label for
> >> broker metrics. Also, brokers use Yammer metrics, which doesn't have
> >> type
> >> Sum.
> >>
> >> 29. There are several client metrics listed as histogram. However,
> >> the java
> >> client currently doesn't support histogram type.
> >>
> >> 30. Could you show an example of the metric payload in
> >> PushTelemetryRequest
> >> to help understand how we organize metrics at different levels (per
> >> instance, per topic, per partition, per broker, etc)?
> >>
> >> 31. Could you add a bit more detail on which client thread sends the
> >> PushTelemetryRequest?
> >>
> >> Thanks,
> >>
> >> Jun
> >>
> >> On Mon, Mar 7, 2022 at 11:48 AM Magnus Edenhill  >
> >> wrote:
> >>
> >> > Hi Jun,
> >> >
> >> > thanks for your initiated questions, see my answers below.
> >> > There's been a number of clarifications to the KIP.
> >> >
> >> >
> >> >
> >> > Den tors 27 jan. 2022 kl 20:08 skrev Jun Rao
> >> :
> >> >
> >> > > Hi, Magnus,
> >> > >
> >> > > Thanks for 

Re: [DISCUSS] KIP-714: Client metrics and observability

2022-03-21 Thread Jun Rao
Hi, Kirk, Sarat,

A few more comments.

40. GetTelemetrySubscriptionsResponseV0 : RequestedMetrics Array[string]
uses "Array[0] empty string" to represent all metrics subscribed. We had a
similar issue with the topics field in MetadataRequest and used the
following convention.
In version 1 and higher, an empty array indicates "request metadata for no
topics," and a null array is used to indicate "request metadata for all
topics."
Should we use the same convention in GetTelemetrySubscriptionsResponseV0?

41. We include CompressionType in PushTelemetryRequestV0, but not in
ClientTelemetryPayload. How would the implementer know the compression type
for the telemetry payload?

42. For blocking the metrics for certain clients in the following example,
could you describe the corresponding config value used through the
kafka-config command?
kafka-client-metrics.sh --bootstrap-server $BROKERS \
   --add \
   --name 'Disabe_b69cc35a' \  # A descriptive name makes it easier to
clean up old subscriptions.
   --match client_instance_id=b69cc35a-7a54-4790-aa69-cc2bd4ee4538 \  #
Match this specific client instance
   --block

Thanks,

Jun


On Thu, Mar 10, 2022 at 11:57 AM Jun Rao  wrote:

> Hi, Kirk, Sarat,
>
> Thanks for the reply.
>
> 28. On the broker, we typically use Yammer metrics. Only for metrics that
> depend on Kafka metric features (e.g., quota), we use the Kafka metric.
> Yammer metrics have 4 types: gauge, meter, histogram and timer. meter
> calculates a rate, but also exposes an accumulated value.
>
> 29. The Histogram class in org.apache.kafka.common.metrics.stats was never
> used in the client metrics. The implementation of Histogram only provides a
> fixed number of values in the domain and may not capture the quantiles very
> accurately. So, we punted on using it.
>
> Thanks,
>
> Jun
>
>
>
> On Thu, Mar 10, 2022 at 10:59 AM Sarat Kakarla
>  wrote:
>
>> Jun,
>>
>>   >>  28. For the broker metrics, could you spell out the full metric name
>>   >>   including groups, tags, etc? We typically don't add the broker_id
>> label for
>>   >>   broker metrics. Also, brokers use Yammer metrics, which doesn't
>> have type
>>   >>   Sum.
>>
>> Sure,  I will update the KIP-714 with the above information, will remove
>> the broker-id label from the metrics.
>>
>> Regarding the type is CumulativeSum the right type to use in the place of
>> Sum?
>>
>> Thanks
>> Sarat
>>
>>
>> On 3/8/22, 5:48 PM, "Jun Rao"  wrote:
>>
>> Hi, Magnus, Sarat and Xavier,
>>
>> Thanks for the reply. A few more comments below.
>>
>> 20. It seems that we are piggybacking the plugin on the
>> existing MetricsReporter. So, this seems fine.
>>
>> 21. That could work. Are we requiring any additional jar dependency
>> on the
>> client? Or, are you suggesting that we check the runtime dependency
>> to pick
>> the compression codec?
>>
>> 28. For the broker metrics, could you spell out the full metric name
>> including groups, tags, etc? We typically don't add the broker_id
>> label for
>> broker metrics. Also, brokers use Yammer metrics, which doesn't have
>> type
>> Sum.
>>
>> 29. There are several client metrics listed as histogram. However,
>> the java
>> client currently doesn't support histogram type.
>>
>> 30. Could you show an example of the metric payload in
>> PushTelemetryRequest
>> to help understand how we organize metrics at different levels (per
>> instance, per topic, per partition, per broker, etc)?
>>
>> 31. Could you add a bit more detail on which client thread sends the
>> PushTelemetryRequest?
>>
>> Thanks,
>>
>> Jun
>>
>> On Mon, Mar 7, 2022 at 11:48 AM Magnus Edenhill 
>> wrote:
>>
>> > Hi Jun,
>> >
>> > thanks for your initiated questions, see my answers below.
>> > There's been a number of clarifications to the KIP.
>> >
>> >
>> >
>> > Den tors 27 jan. 2022 kl 20:08 skrev Jun Rao
>> :
>> >
>> > > Hi, Magnus,
>> > >
>> > > Thanks for updating the KIP. The overall approach makes sense to
>> me. A
>> > few
>> > > more detailed comments below.
>> > >
>> > > 20. ClientTelemetry: Should it be extending configurable and
>> closable?
>> > >
>> >
>> > I'll pass this question to Sarat and/or Xavier.
>> >
>> >
>> >
>> > > 21. Compression of the metrics on the client: what's the default?
>> > >
>> >
>> > How about we specify a prioritized list: zstd, lz4, snappy, gzip?
>> > But ultimately it is up to what the client supports.
>> >
>> >
>> > 23. A client instance is considered a metric resource and the
>> > > resource-level (thus client instance level) labels could include:
>> > > client_software_name=confluent-kafka-python
>> > > client_software_version=v2.1.3
>> > > client_instance_id=B64CD139-3975-440A-91D4
>> > > transactional_id=someTxnApp
>> > > Are those labels added in PushTelemetryRequest? 

Re: [DISCUSS] KIP-714: Client metrics and observability

2022-03-10 Thread Jun Rao
Hi, Kirk, Sarat,

Thanks for the reply.

28. On the broker, we typically use Yammer metrics. Only for metrics that
depend on Kafka metric features (e.g., quota), we use the Kafka metric.
Yammer metrics have 4 types: gauge, meter, histogram and timer. meter
calculates a rate, but also exposes an accumulated value.

29. The Histogram class in org.apache.kafka.common.metrics.stats was never
used in the client metrics. The implementation of Histogram only provides a
fixed number of values in the domain and may not capture the quantiles very
accurately. So, we punted on using it.

Thanks,

Jun



On Thu, Mar 10, 2022 at 10:59 AM Sarat Kakarla
 wrote:

> Jun,
>
>   >>  28. For the broker metrics, could you spell out the full metric name
>   >>   including groups, tags, etc? We typically don't add the broker_id
> label for
>   >>   broker metrics. Also, brokers use Yammer metrics, which doesn't
> have type
>   >>   Sum.
>
> Sure,  I will update the KIP-714 with the above information, will remove
> the broker-id label from the metrics.
>
> Regarding the type is CumulativeSum the right type to use in the place of
> Sum?
>
> Thanks
> Sarat
>
>
> On 3/8/22, 5:48 PM, "Jun Rao"  wrote:
>
> Hi, Magnus, Sarat and Xavier,
>
> Thanks for the reply. A few more comments below.
>
> 20. It seems that we are piggybacking the plugin on the
> existing MetricsReporter. So, this seems fine.
>
> 21. That could work. Are we requiring any additional jar dependency on
> the
> client? Or, are you suggesting that we check the runtime dependency to
> pick
> the compression codec?
>
> 28. For the broker metrics, could you spell out the full metric name
> including groups, tags, etc? We typically don't add the broker_id
> label for
> broker metrics. Also, brokers use Yammer metrics, which doesn't have
> type
> Sum.
>
> 29. There are several client metrics listed as histogram. However, the
> java
> client currently doesn't support histogram type.
>
> 30. Could you show an example of the metric payload in
> PushTelemetryRequest
> to help understand how we organize metrics at different levels (per
> instance, per topic, per partition, per broker, etc)?
>
> 31. Could you add a bit more detail on which client thread sends the
> PushTelemetryRequest?
>
> Thanks,
>
> Jun
>
> On Mon, Mar 7, 2022 at 11:48 AM Magnus Edenhill 
> wrote:
>
> > Hi Jun,
> >
> > thanks for your initiated questions, see my answers below.
> > There's been a number of clarifications to the KIP.
> >
> >
> >
> > Den tors 27 jan. 2022 kl 20:08 skrev Jun Rao
> :
> >
> > > Hi, Magnus,
> > >
> > > Thanks for updating the KIP. The overall approach makes sense to
> me. A
> > few
> > > more detailed comments below.
> > >
> > > 20. ClientTelemetry: Should it be extending configurable and
> closable?
> > >
> >
> > I'll pass this question to Sarat and/or Xavier.
> >
> >
> >
> > > 21. Compression of the metrics on the client: what's the default?
> > >
> >
> > How about we specify a prioritized list: zstd, lz4, snappy, gzip?
> > But ultimately it is up to what the client supports.
> >
> >
> > 23. A client instance is considered a metric resource and the
> > > resource-level (thus client instance level) labels could include:
> > > client_software_name=confluent-kafka-python
> > > client_software_version=v2.1.3
> > > client_instance_id=B64CD139-3975-440A-91D4
> > > transactional_id=someTxnApp
> > > Are those labels added in PushTelemetryRequest? If so, are they per
> > metric
> > > or per request?
> > >
> >
> >
> > client_software* and client_instance_id are not added by the client,
> but
> > available to
> > the broker-side metrics plugin for adding as it see fits, remove
> them from
> > the KIP.
> >
> > As for transactional_id, group_id, etc, which I believe will be
> useful in
> > troubleshooting,
> > are included only once (per push) as resource-level attributes (the
> client
> > instance is a singular resource).
> >
> >
> > >
> > > 24.  "the broker will only send
> > > GetTelemetrySubscriptionsResponse.DeltaTemporality=True" :
> > > 24.1 If it's always true, does it need to be part of the protocol?
> > >
> >
> > We're anticipating that it will take a lot longer to upgrade the
> majority
> > of clients than the
> > broker/plugin side, which is why we want the client to support both
> > temporalities out-of-the-box
> > so that cumulative reporting can be turned on seamlessly in the
> future.
> >
> >
> >
> > > 24.2 Does delta only apply to Counter type?
> > >
> >
> >
> > And Histograms. More details in Xavier's OTLP link.
> >
> >
> >
> > > 24.3 In the delta representation, the first request 

Re: [DISCUSS] KIP-714: Client metrics and observability

2022-03-10 Thread Sarat Kakarla
Jun,
 
  >>  28. For the broker metrics, could you spell out the full metric name
  >>   including groups, tags, etc? We typically don't add the broker_id label 
for
  >>   broker metrics. Also, brokers use Yammer metrics, which doesn't have type
  >>   Sum.

Sure,  I will update the KIP-714 with the above information, will remove the 
broker-id label from the metrics.

Regarding the type is CumulativeSum the right type to use in the place of Sum?

Thanks
Sarat


On 3/8/22, 5:48 PM, "Jun Rao"  wrote:

Hi, Magnus, Sarat and Xavier,

Thanks for the reply. A few more comments below.

20. It seems that we are piggybacking the plugin on the
existing MetricsReporter. So, this seems fine.

21. That could work. Are we requiring any additional jar dependency on the
client? Or, are you suggesting that we check the runtime dependency to pick
the compression codec?

28. For the broker metrics, could you spell out the full metric name
including groups, tags, etc? We typically don't add the broker_id label for
broker metrics. Also, brokers use Yammer metrics, which doesn't have type
Sum.

29. There are several client metrics listed as histogram. However, the java
client currently doesn't support histogram type.

30. Could you show an example of the metric payload in PushTelemetryRequest
to help understand how we organize metrics at different levels (per
instance, per topic, per partition, per broker, etc)?

31. Could you add a bit more detail on which client thread sends the
PushTelemetryRequest?

Thanks,

Jun

On Mon, Mar 7, 2022 at 11:48 AM Magnus Edenhill  wrote:

> Hi Jun,
>
> thanks for your initiated questions, see my answers below.
> There's been a number of clarifications to the KIP.
>
>
>
> Den tors 27 jan. 2022 kl 20:08 skrev Jun Rao :
>
> > Hi, Magnus,
> >
> > Thanks for updating the KIP. The overall approach makes sense to me. A
> few
> > more detailed comments below.
> >
> > 20. ClientTelemetry: Should it be extending configurable and closable?
> >
>
> I'll pass this question to Sarat and/or Xavier.
>
>
>
> > 21. Compression of the metrics on the client: what's the default?
> >
>
> How about we specify a prioritized list: zstd, lz4, snappy, gzip?
> But ultimately it is up to what the client supports.
>
>
> 23. A client instance is considered a metric resource and the
> > resource-level (thus client instance level) labels could include:
> > client_software_name=confluent-kafka-python
> > client_software_version=v2.1.3
> > client_instance_id=B64CD139-3975-440A-91D4
> > transactional_id=someTxnApp
> > Are those labels added in PushTelemetryRequest? If so, are they per
> metric
> > or per request?
> >
>
>
> client_software* and client_instance_id are not added by the client, but
> available to
> the broker-side metrics plugin for adding as it see fits, remove them from
> the KIP.
>
> As for transactional_id, group_id, etc, which I believe will be useful in
> troubleshooting,
> are included only once (per push) as resource-level attributes (the client
> instance is a singular resource).
>
>
> >
> > 24.  "the broker will only send
> > GetTelemetrySubscriptionsResponse.DeltaTemporality=True" :
> > 24.1 If it's always true, does it need to be part of the protocol?
> >
>
> We're anticipating that it will take a lot longer to upgrade the majority
> of clients than the
> broker/plugin side, which is why we want the client to support both
> temporalities out-of-the-box
> so that cumulative reporting can be turned on seamlessly in the future.
>
>
>
> > 24.2 Does delta only apply to Counter type?
> >
>
>
> And Histograms. More details in Xavier's OTLP link.
>
>
>
> > 24.3 In the delta representation, the first request needs to send the
> full
> > value, how does the broker plugin know whether a value is full or delta?
> >
>
> The client may (should) send the start time for each metric sample,
> indicating when
> the metric began to be collected.
> We've discussed whether this should be the client instance start time or
> the time when a matching
> metric subscription for that metric is received.
> For completeness we recommend using the former, the client instance start
> time.
>
>
>
> > 25. quota:
> > 25.1 Since we are fitting PushTelemetryRequest into the existing request
> > quota, it would be useful to document the impact, i.e. client metric
> > throttling causes the data from the same client to be delayed.
> > 25.2 Is PushTelemetryRequest subject to the write bandwidth quota like
> the
> > producer?
> >
>
>
> Yes, it 

Re: [DISCUSS] KIP-714: Client metrics and observability

2022-03-09 Thread Kirk True
Hi Jun,

On Tue, Mar 8, 2022, at 5:47 PM, Jun Rao wrote:
> Hi, Magnus, Sarat and Xavier,
> 
> Thanks for the reply. A few more comments below.
> 
> 20. It seems that we are piggybacking the plugin on the
> existing MetricsReporter. So, this seems fine.
> 
> 21. That could work. Are we requiring any additional jar dependency on the
> client? Or, are you suggesting that we check the runtime dependency to pick
> the compression codec?

The Java client doesn't require any additional libraries for compression, no.

> 28. For the broker metrics, could you spell out the full metric name
> including groups, tags, etc? We typically don't add the broker_id label for
> broker metrics. Also, brokers use Yammer metrics, which doesn't have type
> Sum.
> 
> 29. There are several client metrics listed as histogram. However, the java
> client currently doesn't support histogram type.

There does appear to be some code related to histograms in the 
org.apache.kafka.common.metrics.stats package. But we're still looking into the 
implementation to see if there's anything needed for KIP-714.

> 30. Could you show an example of the metric payload in PushTelemetryRequest
> to help understand how we organize metrics at different levels (per
> instance, per topic, per partition, per broker, etc)?
> 
> 31. Could you add a bit more detail on which client thread sends the
> PushTelemetryRequest?

Yes, I will add that the KIP.

Thanks,
Kirk

> Thanks,
> 
> Jun
> 
> On Mon, Mar 7, 2022 at 11:48 AM Magnus Edenhill  wrote:
> 
> > Hi Jun,
> >
> > thanks for your initiated questions, see my answers below.
> > There's been a number of clarifications to the KIP.
> >
> >
> >
> > Den tors 27 jan. 2022 kl 20:08 skrev Jun Rao :
> >
> > > Hi, Magnus,
> > >
> > > Thanks for updating the KIP. The overall approach makes sense to me. A
> > few
> > > more detailed comments below.
> > >
> > > 20. ClientTelemetry: Should it be extending configurable and closable?
> > >
> >
> > I'll pass this question to Sarat and/or Xavier.
> >
> >
> >
> > > 21. Compression of the metrics on the client: what's the default?
> > >
> >
> > How about we specify a prioritized list: zstd, lz4, snappy, gzip?
> > But ultimately it is up to what the client supports.
> >
> >
> > 23. A client instance is considered a metric resource and the
> > > resource-level (thus client instance level) labels could include:
> > > client_software_name=confluent-kafka-python
> > > client_software_version=v2.1.3
> > > client_instance_id=B64CD139-3975-440A-91D4
> > > transactional_id=someTxnApp
> > > Are those labels added in PushTelemetryRequest? If so, are they per
> > metric
> > > or per request?
> > >
> >
> >
> > client_software* and client_instance_id are not added by the client, but
> > available to
> > the broker-side metrics plugin for adding as it see fits, remove them from
> > the KIP.
> >
> > As for transactional_id, group_id, etc, which I believe will be useful in
> > troubleshooting,
> > are included only once (per push) as resource-level attributes (the client
> > instance is a singular resource).
> >
> >
> > >
> > > 24.  "the broker will only send
> > > GetTelemetrySubscriptionsResponse.DeltaTemporality=True" :
> > > 24.1 If it's always true, does it need to be part of the protocol?
> > >
> >
> > We're anticipating that it will take a lot longer to upgrade the majority
> > of clients than the
> > broker/plugin side, which is why we want the client to support both
> > temporalities out-of-the-box
> > so that cumulative reporting can be turned on seamlessly in the future.
> >
> >
> >
> > > 24.2 Does delta only apply to Counter type?
> > >
> >
> >
> > And Histograms. More details in Xavier's OTLP link.
> >
> >
> >
> > > 24.3 In the delta representation, the first request needs to send the
> > full
> > > value, how does the broker plugin know whether a value is full or delta?
> > >
> >
> > The client may (should) send the start time for each metric sample,
> > indicating when
> > the metric began to be collected.
> > We've discussed whether this should be the client instance start time or
> > the time when a matching
> > metric subscription for that metric is received.
> > For completeness we recommend using the former, the client instance start
> > time.
> >
> >
> >
> > > 25. quota:
> > > 25.1 Since we are fitting PushTelemetryRequest into the existing request
> > > quota, it would be useful to document the impact, i.e. client metric
> > > throttling causes the data from the same client to be delayed.
> > > 25.2 Is PushTelemetryRequest subject to the write bandwidth quota like
> > the
> > > producer?
> > >
> >
> >
> > Yes, it should be, as to protect the cluster from rogue clients.
> > But, in practice the size of metrics will be quite low (e.g., 1-10kb per
> > 60s interval), so I don't think this will pose a problem.
> > The KIP has been updated with more details on quota/throttling behaviour,
> > see the
> > "Throttling and rate-limiting" section.
> >
> >
> > 25.3 

Re: [DISCUSS] KIP-714: Client metrics and observability

2022-03-08 Thread Jun Rao
Hi, Magnus, Sarat and Xavier,

Thanks for the reply. A few more comments below.

20. It seems that we are piggybacking the plugin on the
existing MetricsReporter. So, this seems fine.

21. That could work. Are we requiring any additional jar dependency on the
client? Or, are you suggesting that we check the runtime dependency to pick
the compression codec?

28. For the broker metrics, could you spell out the full metric name
including groups, tags, etc? We typically don't add the broker_id label for
broker metrics. Also, brokers use Yammer metrics, which doesn't have type
Sum.

29. There are several client metrics listed as histogram. However, the java
client currently doesn't support histogram type.

30. Could you show an example of the metric payload in PushTelemetryRequest
to help understand how we organize metrics at different levels (per
instance, per topic, per partition, per broker, etc)?

31. Could you add a bit more detail on which client thread sends the
PushTelemetryRequest?

Thanks,

Jun

On Mon, Mar 7, 2022 at 11:48 AM Magnus Edenhill  wrote:

> Hi Jun,
>
> thanks for your initiated questions, see my answers below.
> There's been a number of clarifications to the KIP.
>
>
>
> Den tors 27 jan. 2022 kl 20:08 skrev Jun Rao :
>
> > Hi, Magnus,
> >
> > Thanks for updating the KIP. The overall approach makes sense to me. A
> few
> > more detailed comments below.
> >
> > 20. ClientTelemetry: Should it be extending configurable and closable?
> >
>
> I'll pass this question to Sarat and/or Xavier.
>
>
>
> > 21. Compression of the metrics on the client: what's the default?
> >
>
> How about we specify a prioritized list: zstd, lz4, snappy, gzip?
> But ultimately it is up to what the client supports.
>
>
> 23. A client instance is considered a metric resource and the
> > resource-level (thus client instance level) labels could include:
> > client_software_name=confluent-kafka-python
> > client_software_version=v2.1.3
> > client_instance_id=B64CD139-3975-440A-91D4
> > transactional_id=someTxnApp
> > Are those labels added in PushTelemetryRequest? If so, are they per
> metric
> > or per request?
> >
>
>
> client_software* and client_instance_id are not added by the client, but
> available to
> the broker-side metrics plugin for adding as it see fits, remove them from
> the KIP.
>
> As for transactional_id, group_id, etc, which I believe will be useful in
> troubleshooting,
> are included only once (per push) as resource-level attributes (the client
> instance is a singular resource).
>
>
> >
> > 24.  "the broker will only send
> > GetTelemetrySubscriptionsResponse.DeltaTemporality=True" :
> > 24.1 If it's always true, does it need to be part of the protocol?
> >
>
> We're anticipating that it will take a lot longer to upgrade the majority
> of clients than the
> broker/plugin side, which is why we want the client to support both
> temporalities out-of-the-box
> so that cumulative reporting can be turned on seamlessly in the future.
>
>
>
> > 24.2 Does delta only apply to Counter type?
> >
>
>
> And Histograms. More details in Xavier's OTLP link.
>
>
>
> > 24.3 In the delta representation, the first request needs to send the
> full
> > value, how does the broker plugin know whether a value is full or delta?
> >
>
> The client may (should) send the start time for each metric sample,
> indicating when
> the metric began to be collected.
> We've discussed whether this should be the client instance start time or
> the time when a matching
> metric subscription for that metric is received.
> For completeness we recommend using the former, the client instance start
> time.
>
>
>
> > 25. quota:
> > 25.1 Since we are fitting PushTelemetryRequest into the existing request
> > quota, it would be useful to document the impact, i.e. client metric
> > throttling causes the data from the same client to be delayed.
> > 25.2 Is PushTelemetryRequest subject to the write bandwidth quota like
> the
> > producer?
> >
>
>
> Yes, it should be, as to protect the cluster from rogue clients.
> But, in practice the size of metrics will be quite low (e.g., 1-10kb per
> 60s interval), so I don't think this will pose a problem.
> The KIP has been updated with more details on quota/throttling behaviour,
> see the
> "Throttling and rate-limiting" section.
>
>
> 25.3 THROTTLING_QUOTA_EXCEEDED: Currently, we don't send this error when
> > the request/bandwidth quota is exceeded since those requests are not
> > rejected. We only set this error when the request is rejected (e.g.,
> topic
> > creation). It would be useful to clarify when this error is used.
> >
>
> Right, I was trying to reuse an existing error-code. We can introduce
> a new one for the case where a client pushes metrics at a higher frequency
> than the
> than the configured push interval (e.g., out-of-profile sends).
> This causes the broker to drop those metrics and send this error code back
> to the client. There will be no connection throttling / 

Re: [DISCUSS] KIP-714: Client metrics and observability

2022-03-08 Thread Tom Bentley
Hi Ashenafi,

You'll need to unsubscribe from the dev mailing list by sending an email to
dev-unsubscr...@kafka.apache.org. No one else can do this for you.

Kind regards,

Tom

On Tue, 8 Mar 2022 at 04:40, Ashenafi Marcos  wrote:

> Hi,
> Can you please take out my email I’d so that will not be able to receive
> any mail from you.
> Thank you
>
> On Tue, Oct 19, 2021 at 1:30 PM Mickael Maison 
> wrote:
>
> > Hi Magnus,
> >
> > Thanks for the proposal.
> >
> > 1. Looking at the protocol section, isn't "ClientInstanceId" expected
> > to be a field in GetTelemetrySubscriptionsResponseV0? Otherwise, how
> > does a client retrieve this value?
> >
> > 2. In the client API section, you mention a new method
> > "clientInstanceId()". Can you clarify which interfaces are affected?
> > Is it only Consumer and Producer?
> >
> > 3. I'm a bit concerned this is enabled by default. Even if the data
> > collected is supposed to be not sensitive, I think this can be
> > problematic in some environments. Also users don't seem to have the
> > choice to only expose some metrics. Knowing how much data transit
> > through some applications can be considered critical.
> >
> > 4. As a user, how do you know if your application is actively sending
> > metrics? Are there new metrics exposing what's going on, like how much
> > data is being sent?
> >
> > 5. If all metrics are enabled on a regular Consumer or Producer, do
> > you have an idea how much throughput this would use?
> >
> > Thanks
> >
> > On Tue, Oct 19, 2021 at 5:06 PM Magnus Edenhill 
> > wrote:
> > >
> > > Den tis 19 okt. 2021 kl 13:22 skrev Tom Bentley :
> > >
> > > > Hi Magnus,
> > > >
> > > > I reviewed the KIP since you called the vote (sorry for not reviewing
> > when
> > > > you announced your intention to call the vote). I have a few
> questions
> > on
> > > > some of the details.
> > > >
> > > > 1. There's no Javadoc on ClientTelemetryPayload.data(), so I don't
> know
> > > > whether the payload is exposed through this method as compressed or
> > not.
> > > > Later on you say "Decompression of the payloads will be handled by
> the
> > > > broker metrics plugin, the broker should expose a suitable
> > decompression
> > > > API to the metrics plugin for this purpose.", which suggests it's the
> > > > compressed data in the buffer, but then we don't know which codec was
> > used,
> > > > nor the API via which the plugin should decompress it if required for
> > > > forwarding to the ultimate metrics store. Should the
> > ClientTelemetryPayload
> > > > expose a method to get the compression and a decompressor?
> > > >
> > >
> > > Good point, updated.
> > >
> > >
> > >
> > > > 2. The client-side API is expressed as StringOrError
> > > > ClientInstance::ClientInstanceId(int timeout_ms). I understand that
> > you're
> > > > thinking about the librdkafka implementation, but it would be good to
> > show
> > > > the API as it would appear on the Apache Kafka clients.
> > > >
> > >
> > > This was meant as pseudo-code, but I changed it to Java.
> > >
> > >
> > > > 3. "PushTelemetryRequest|Response - protocol request used by the
> > client to
> > > > send metrics to any broker it is connected to." To be clear, this
> means
> > > > that the client can choose any of the connected brokers and push to
> > just
> > > > one of them? What should a supporting client do if it gets an error
> > when
> > > > pushing metrics to a broker, retry sending to the same broker or try
> > > > pushing to another broker, or drop the metrics? Should supporting
> > clients
> > > > send successive requests to a single broker, or round robin, or is
> > that up
> > > > to the client author? I'm guessing the behaviour should be sticky to
> > > > support the rate limiting features, but I think it would be good for
> > client
> > > > authors if this section were explicit on the recommended behaviour.
> > > >
> > >
> > > You are right, I've updated the KIP to make this clearer.
> > >
> > >
> > > > 4. "Mapping the client instance id to an actual application instance
> > > > running on a (virtual) machine can be done by inspecting the metrics
> > > > resource labels, such as the client source address and source port,
> or
> > > > security principal, all of which are added by the receiving broker.
> > This
> > > > will allow the operator together with the user to identify the actual
> > > > application instance." Is this really always true? The source IP and
> > port
> > > > might be a loadbalancer/proxy in some setups. The principal, as
> already
> > > > mentioned in the KIP, might be shared between multiple applications.
> > So at
> > > > worst the organization running the clients might have to consult the
> > logs
> > > > of a set of client applications, right?
> > > >
> > >
> > > Yes, that's correct. There's no guaranteed mapping from
> > client_instance_id
> > > to
> > > an actual instance, that's why the KIP recommends client
> implementations
> > to
> > > log the client instance id
> > > upon 

Re: [DISCUSS] KIP-714: Client metrics and observability

2022-03-07 Thread Ashenafi Marcos
Hi,
Can you please take out my email I’d so that will not be able to receive
any mail from you.
Thank you

On Tue, Oct 19, 2021 at 1:30 PM Mickael Maison 
wrote:

> Hi Magnus,
>
> Thanks for the proposal.
>
> 1. Looking at the protocol section, isn't "ClientInstanceId" expected
> to be a field in GetTelemetrySubscriptionsResponseV0? Otherwise, how
> does a client retrieve this value?
>
> 2. In the client API section, you mention a new method
> "clientInstanceId()". Can you clarify which interfaces are affected?
> Is it only Consumer and Producer?
>
> 3. I'm a bit concerned this is enabled by default. Even if the data
> collected is supposed to be not sensitive, I think this can be
> problematic in some environments. Also users don't seem to have the
> choice to only expose some metrics. Knowing how much data transit
> through some applications can be considered critical.
>
> 4. As a user, how do you know if your application is actively sending
> metrics? Are there new metrics exposing what's going on, like how much
> data is being sent?
>
> 5. If all metrics are enabled on a regular Consumer or Producer, do
> you have an idea how much throughput this would use?
>
> Thanks
>
> On Tue, Oct 19, 2021 at 5:06 PM Magnus Edenhill 
> wrote:
> >
> > Den tis 19 okt. 2021 kl 13:22 skrev Tom Bentley :
> >
> > > Hi Magnus,
> > >
> > > I reviewed the KIP since you called the vote (sorry for not reviewing
> when
> > > you announced your intention to call the vote). I have a few questions
> on
> > > some of the details.
> > >
> > > 1. There's no Javadoc on ClientTelemetryPayload.data(), so I don't know
> > > whether the payload is exposed through this method as compressed or
> not.
> > > Later on you say "Decompression of the payloads will be handled by the
> > > broker metrics plugin, the broker should expose a suitable
> decompression
> > > API to the metrics plugin for this purpose.", which suggests it's the
> > > compressed data in the buffer, but then we don't know which codec was
> used,
> > > nor the API via which the plugin should decompress it if required for
> > > forwarding to the ultimate metrics store. Should the
> ClientTelemetryPayload
> > > expose a method to get the compression and a decompressor?
> > >
> >
> > Good point, updated.
> >
> >
> >
> > > 2. The client-side API is expressed as StringOrError
> > > ClientInstance::ClientInstanceId(int timeout_ms). I understand that
> you're
> > > thinking about the librdkafka implementation, but it would be good to
> show
> > > the API as it would appear on the Apache Kafka clients.
> > >
> >
> > This was meant as pseudo-code, but I changed it to Java.
> >
> >
> > > 3. "PushTelemetryRequest|Response - protocol request used by the
> client to
> > > send metrics to any broker it is connected to." To be clear, this means
> > > that the client can choose any of the connected brokers and push to
> just
> > > one of them? What should a supporting client do if it gets an error
> when
> > > pushing metrics to a broker, retry sending to the same broker or try
> > > pushing to another broker, or drop the metrics? Should supporting
> clients
> > > send successive requests to a single broker, or round robin, or is
> that up
> > > to the client author? I'm guessing the behaviour should be sticky to
> > > support the rate limiting features, but I think it would be good for
> client
> > > authors if this section were explicit on the recommended behaviour.
> > >
> >
> > You are right, I've updated the KIP to make this clearer.
> >
> >
> > > 4. "Mapping the client instance id to an actual application instance
> > > running on a (virtual) machine can be done by inspecting the metrics
> > > resource labels, such as the client source address and source port, or
> > > security principal, all of which are added by the receiving broker.
> This
> > > will allow the operator together with the user to identify the actual
> > > application instance." Is this really always true? The source IP and
> port
> > > might be a loadbalancer/proxy in some setups. The principal, as already
> > > mentioned in the KIP, might be shared between multiple applications.
> So at
> > > worst the organization running the clients might have to consult the
> logs
> > > of a set of client applications, right?
> > >
> >
> > Yes, that's correct. There's no guaranteed mapping from
> client_instance_id
> > to
> > an actual instance, that's why the KIP recommends client implementations
> to
> > log the client instance id
> > upon retrieval, and also provide an API for the application to retrieve
> the
> > instance id programmatically
> > if it has a better way of exposing it.
> >
> >
> > 5. "Tests indicate that a compression ratio up to 10x is possible for the
> > > standard metrics." Client authors might appreciate your mentioning
> which
> > > compression codec got these results.
> > >
> >
> > Good point. Updated.
> >
> >
> > > 6. "Should the client send a push request prior to expiry of the
> previously
> > > 

Re: [DISCUSS] KIP-714: Client metrics and observability

2022-03-07 Thread Magnus Edenhill
Hi Jun,

thanks for your initiated questions, see my answers below.
There's been a number of clarifications to the KIP.



Den tors 27 jan. 2022 kl 20:08 skrev Jun Rao :

> Hi, Magnus,
>
> Thanks for updating the KIP. The overall approach makes sense to me. A few
> more detailed comments below.
>
> 20. ClientTelemetry: Should it be extending configurable and closable?
>

I'll pass this question to Sarat and/or Xavier.



> 21. Compression of the metrics on the client: what's the default?
>

How about we specify a prioritized list: zstd, lz4, snappy, gzip?
But ultimately it is up to what the client supports.


23. A client instance is considered a metric resource and the
> resource-level (thus client instance level) labels could include:
> client_software_name=confluent-kafka-python
> client_software_version=v2.1.3
> client_instance_id=B64CD139-3975-440A-91D4
> transactional_id=someTxnApp
> Are those labels added in PushTelemetryRequest? If so, are they per metric
> or per request?
>


client_software* and client_instance_id are not added by the client, but
available to
the broker-side metrics plugin for adding as it see fits, remove them from
the KIP.

As for transactional_id, group_id, etc, which I believe will be useful in
troubleshooting,
are included only once (per push) as resource-level attributes (the client
instance is a singular resource).


>
> 24.  "the broker will only send
> GetTelemetrySubscriptionsResponse.DeltaTemporality=True" :
> 24.1 If it's always true, does it need to be part of the protocol?
>

We're anticipating that it will take a lot longer to upgrade the majority
of clients than the
broker/plugin side, which is why we want the client to support both
temporalities out-of-the-box
so that cumulative reporting can be turned on seamlessly in the future.



> 24.2 Does delta only apply to Counter type?
>


And Histograms. More details in Xavier's OTLP link.



> 24.3 In the delta representation, the first request needs to send the full
> value, how does the broker plugin know whether a value is full or delta?
>

The client may (should) send the start time for each metric sample,
indicating when
the metric began to be collected.
We've discussed whether this should be the client instance start time or
the time when a matching
metric subscription for that metric is received.
For completeness we recommend using the former, the client instance start
time.



> 25. quota:
> 25.1 Since we are fitting PushTelemetryRequest into the existing request
> quota, it would be useful to document the impact, i.e. client metric
> throttling causes the data from the same client to be delayed.
> 25.2 Is PushTelemetryRequest subject to the write bandwidth quota like the
> producer?
>


Yes, it should be, as to protect the cluster from rogue clients.
But, in practice the size of metrics will be quite low (e.g., 1-10kb per
60s interval), so I don't think this will pose a problem.
The KIP has been updated with more details on quota/throttling behaviour,
see the
"Throttling and rate-limiting" section.


25.3 THROTTLING_QUOTA_EXCEEDED: Currently, we don't send this error when
> the request/bandwidth quota is exceeded since those requests are not
> rejected. We only set this error when the request is rejected (e.g., topic
> creation). It would be useful to clarify when this error is used.
>

Right, I was trying to reuse an existing error-code. We can introduce
a new one for the case where a client pushes metrics at a higher frequency
than the
than the configured push interval (e.g., out-of-profile sends).
This causes the broker to drop those metrics and send this error code back
to the client. There will be no connection throttling / channel-muting in
this
case (unless the standard quotas are exceeded).


> 27. kafka-client-metrics.sh: Could we add an example on how to disable a
> bad client?
>

There's now a --block option to kafka-client-metrics.sh which overrides all
subscriptions
for the matched client(s). This allows silencing metrics for one or more
clients without having
to remove existing subscriptions. From the client's perspective it will
look like it no longer has
any subscriptions.

# Block metrics collection for a specific client instance
$ kafka-client-metrics.sh --bootstrap-server $BROKERS \
   --add \
   --name 'Disabe_b69cc35a' \  # A descriptive name makes it easier to
clean up old subscriptions.
   --match client_instance_id=b69cc35a-7a54-4790-aa69-cc2bd4ee4538 \  #
Match this specific client instance
   --block




> 28. New broker side metrics: Could we spell out the details of the metrics
> (e.g., group, tags, etc)?
>

KIP has been updated accordingly (thanks Sarat).



>
> 29. Client instance-level metrics: client.io.wait.time is a gauge not a
> histogram.
>

I believe a population/distribution should preferably be represented as a
histogram, space permitting,
and only secondarily as a Gauge average.
While we might not want to maintain a bunch of histograms for each

Re: [DISCUSS] KIP-714: Client metrics and observability

2022-02-02 Thread Xavier Léauté
 24.2 Does delta only apply to Counter type?

> 24.3 In the delta representation, the first request needs to send the full
> value, how does the broker plugin know whether a value is full or delta?
>

The temporarily semantics are defined by the OpenTelemetry data model.
Deferring to OpenTelemetry avoids having to redefine all those semantics in
the Kafka protocol.
https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/datamodel.md

Hopefully that clarifies things,
Xavier


Re: [DISCUSS] KIP-714: Client metrics and observability

2022-01-31 Thread Sarat Kakarla
Jun

Following are the answers for some the questions raised by you.

>> 26. client-metrics entity:
>> 26.1 It seems that we could add multiple entities that match to the same 
>> client. Which one takes precedent?

All the matching client metrics would be compiled into a single list and send 
to the client.

>> 26.2 How do we persist the new client metrics entities? Do we need to add 
>> new ZK paths and new records in KRaft?

The idea is to add a new ConfigResourceType:CLIENT_METRICS and follow the same 
code paths as the other config resources as described in ConfigResource.Type, 
which means new ZK paths and new KRAFT records would be added.

Thanks
Sarat




On Wed, Jan 26, 2022 at 6:32 AM Magnus Edenhill  wrote:

> Hi all,
>
> I've updated the KIP with responses to the latest comments: Java client
> dependencies (Thanks Kirk!), alternate designs (separate cluster, separate
> producer, etc), etc.
>
> I will revive the vote thread.
>
> Thanks,
> Magnus
>
>
> Den mån 13 dec. 2021 kl 22:32 skrev Ryanne Dolan :
>
> > I think we should be very careful about introducing new runtime
> > dependencies into the clients. Historically this has been rare and
> > essentially necessary (e.g. compression libs).
> >
> > Ryanne
> >
> > On Mon, Dec 13, 2021, 1:06 PM Kirk True  wrote:
> >
> > > Hi Jun,
> > >
> > > On Thu, Dec 9, 2021, at 2:28 PM, Jun Rao wrote:
> > > > 13. Using OpenTelemetry. Does that require runtime dependency
> > > > on OpenTelemetry library? How good is the compatibility story
> > > > of OpenTelemetry? This is important since an application could have
> > other
> > > > OpenTelemetry dependencies than the Kafka client.
> > >
> > > The current design is that the OpenTelemetry JARs would ship with the
> > > client. Perhaps we can design the client such that the JARs aren't 
even
> > > loaded if the user has opted out. The user could even exclude the JARs
> > from
> > > their dependencies if they so wished.
> > >
> > > I can't speak to the compatibility of the libraries. Is it possible
> that
> > > we include a shaded version?
> > >
> > > Thanks,
> > > Kirk
> > >
> > > >
> > > > 14. The proposal listed idempotence=true. This is more of a
> > configuration
> > > > than a metric. Are we including that as a metric? What other
> > > configurations
> > > > are we including? Should we separate the configurations from the
> > metrics?
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Mon, Nov 29, 2021 at 7:34 AM Magnus Edenhill 
> > > wrote:
> > > >
> > > > > Hey Bob,
> > > > >
> > > > > That's a good point.
> > > > >
> > > > > Request type labels were considered but since they're already
> tracked
> > > by
> > > > > broker-side metrics
> > > > > they were left out as to avoid metric duplication, however those
> > > metrics
> > > > > are not per connection,
> > > > > so they won't be that useful in practice for troubleshooting
> specific
> > > > > client instances.
> > > > >
> > > > > I'll add the request_type label to the relevant metrics.
> > > > >
> > > > > Thanks,
> > > > > Magnus
> > > > >
> > > > >
> > > > > Den tis 23 nov. 2021 kl 19:20 skrev Bob Barrett
> > > > > :
> > > > >
> > > > > > Hi Magnus,
> > > > > >
> > > > > > Thanks for the thorough KIP, this seems very useful.
> > > > > >
> > > > > > Would it make sense to include the request type as a label for
> the
> > > > > > `client.request.success`, `client.request.errors` and
> > > > > `client.request.rtt`
> > > > > > metrics? I think it would be very useful to see which specific
> > > requests
> > > > > are
> > > > > > succeeding and failing for a client. One specific case I can
> think
> > of
> > > > > where
> > > > > > this could be useful is producer batch timeouts. If a Java
> > > application
> > > > > does
> > > > > > not enable producer client logs (unfortunately, in my experience
> > this
> > > > > > happens more often than it should), the application logs will
> only
> > > > > contain
> > > > > > the expiration error message, but no information about what is
> > > causing
> > > > > the
> > > > > > timeout. The requests might all be succeeding but taking too 
long
> > to
> > > > > > process batches, or metadata requests might be failing, or some
> or
> > > all
> > > > > > produce requests might be failing (if the bootstrap servers are
> > > reachable
> > > > > > from the client but one or more other brokers are not, for
> > example).
> > > If
> > > > > the
> > > > > > cluster operator is able to identify the specific requests that
> are
> > 

Re: [DISCUSS] KIP-714: Client metrics and observability

2022-01-27 Thread Jun Rao
Hi, Magnus,

Thanks for updating the KIP. The overall approach makes sense to me. A few
more detailed comments below.

20. ClientTelemetry: Should it be extending configurable and closable?

21. Compression of the metrics on the client: what's the default?

22. "Client metrics plugin / extending the MetricsReporter interface":
ClientTelemetry doesn't seem to extend MetricsReporter.

23. A client instance is considered a metric resource and the
resource-level (thus client instance level) labels could include:
client_software_name=confluent-kafka-python
client_software_version=v2.1.3
client_instance_id=B64CD139-3975-440A-91D4
transactional_id=someTxnApp
Are those labels added in PushTelemetryRequest? If so, are they per metric
or per request?

24.  "the broker will only send
GetTelemetrySubscriptionsResponse.DeltaTemporality=True" :
24.1 If it's always true, does it need to be part of the protocol?
24.2 Does delta only apply to Counter type?
24.3 In the delta representation, the first request needs to send the full
value, how does the broker plugin know whether a value is full or delta?

25. quota:
25.1 Since we are fitting PushTelemetryRequest into the existing request
quota, it would be useful to document the impact, i.e. client metric
throttling causes the data from the same client to be delayed.
25.2 Is PushTelemetryRequest subject to the write bandwidth quota like the
producer?
25.3 THROTTLING_QUOTA_EXCEEDED: Currently, we don't send this error when
the request/bandwidth quota is exceeded since those requests are not
rejected. We only set this error when the request is rejected (e.g., topic
creation). It would be useful to clarify when this error is used.

26. client-metrics entity:
26.1 It seems that we could add multiple entities that match to the same
client. Which one takes precedent?
26.2 How do we persist the new client metrics entities? Do we need to add
new ZK paths and new records in KRaft?

27. kafka-client-metrics.sh: Could we add an example on how to disable a
bad client?

28. New broker side metrics: Could we spell out the details of the metrics
(e.g., group, tags, etc)?

29. Client instance-level metrics: client.io.wait.time is a gauge not a
histogram.

Thanks,

Jun

On Wed, Jan 26, 2022 at 6:32 AM Magnus Edenhill  wrote:

> Hi all,
>
> I've updated the KIP with responses to the latest comments: Java client
> dependencies (Thanks Kirk!), alternate designs (separate cluster, separate
> producer, etc), etc.
>
> I will revive the vote thread.
>
> Thanks,
> Magnus
>
>
> Den mån 13 dec. 2021 kl 22:32 skrev Ryanne Dolan :
>
> > I think we should be very careful about introducing new runtime
> > dependencies into the clients. Historically this has been rare and
> > essentially necessary (e.g. compression libs).
> >
> > Ryanne
> >
> > On Mon, Dec 13, 2021, 1:06 PM Kirk True  wrote:
> >
> > > Hi Jun,
> > >
> > > On Thu, Dec 9, 2021, at 2:28 PM, Jun Rao wrote:
> > > > 13. Using OpenTelemetry. Does that require runtime dependency
> > > > on OpenTelemetry library? How good is the compatibility story
> > > > of OpenTelemetry? This is important since an application could have
> > other
> > > > OpenTelemetry dependencies than the Kafka client.
> > >
> > > The current design is that the OpenTelemetry JARs would ship with the
> > > client. Perhaps we can design the client such that the JARs aren't even
> > > loaded if the user has opted out. The user could even exclude the JARs
> > from
> > > their dependencies if they so wished.
> > >
> > > I can't speak to the compatibility of the libraries. Is it possible
> that
> > > we include a shaded version?
> > >
> > > Thanks,
> > > Kirk
> > >
> > > >
> > > > 14. The proposal listed idempotence=true. This is more of a
> > configuration
> > > > than a metric. Are we including that as a metric? What other
> > > configurations
> > > > are we including? Should we separate the configurations from the
> > metrics?
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Mon, Nov 29, 2021 at 7:34 AM Magnus Edenhill 
> > > wrote:
> > > >
> > > > > Hey Bob,
> > > > >
> > > > > That's a good point.
> > > > >
> > > > > Request type labels were considered but since they're already
> tracked
> > > by
> > > > > broker-side metrics
> > > > > they were left out as to avoid metric duplication, however those
> > > metrics
> > > > > are not per connection,
> > > > > so they won't be that useful in practice for troubleshooting
> specific
> > > > > client instances.
> > > > >
> > > > > I'll add the request_type label to the relevant metrics.
> > > > >
> > > > > Thanks,
> > > > > Magnus
> > > > >
> > > > >
> > > > > Den tis 23 nov. 2021 kl 19:20 skrev Bob Barrett
> > > > > :
> > > > >
> > > > > > Hi Magnus,
> > > > > >
> > > > > > Thanks for the thorough KIP, this seems very useful.
> > > > > >
> > > > > > Would it make sense to include the request type as a label for
> the
> > > > > > `client.request.success`, `client.request.errors` and
> > > > > 

Re: [DISCUSS] KIP-714: Client metrics and observability

2022-01-26 Thread Magnus Edenhill
Hi all,

I've updated the KIP with responses to the latest comments: Java client
dependencies (Thanks Kirk!), alternate designs (separate cluster, separate
producer, etc), etc.

I will revive the vote thread.

Thanks,
Magnus


Den mån 13 dec. 2021 kl 22:32 skrev Ryanne Dolan :

> I think we should be very careful about introducing new runtime
> dependencies into the clients. Historically this has been rare and
> essentially necessary (e.g. compression libs).
>
> Ryanne
>
> On Mon, Dec 13, 2021, 1:06 PM Kirk True  wrote:
>
> > Hi Jun,
> >
> > On Thu, Dec 9, 2021, at 2:28 PM, Jun Rao wrote:
> > > 13. Using OpenTelemetry. Does that require runtime dependency
> > > on OpenTelemetry library? How good is the compatibility story
> > > of OpenTelemetry? This is important since an application could have
> other
> > > OpenTelemetry dependencies than the Kafka client.
> >
> > The current design is that the OpenTelemetry JARs would ship with the
> > client. Perhaps we can design the client such that the JARs aren't even
> > loaded if the user has opted out. The user could even exclude the JARs
> from
> > their dependencies if they so wished.
> >
> > I can't speak to the compatibility of the libraries. Is it possible that
> > we include a shaded version?
> >
> > Thanks,
> > Kirk
> >
> > >
> > > 14. The proposal listed idempotence=true. This is more of a
> configuration
> > > than a metric. Are we including that as a metric? What other
> > configurations
> > > are we including? Should we separate the configurations from the
> metrics?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Mon, Nov 29, 2021 at 7:34 AM Magnus Edenhill 
> > wrote:
> > >
> > > > Hey Bob,
> > > >
> > > > That's a good point.
> > > >
> > > > Request type labels were considered but since they're already tracked
> > by
> > > > broker-side metrics
> > > > they were left out as to avoid metric duplication, however those
> > metrics
> > > > are not per connection,
> > > > so they won't be that useful in practice for troubleshooting specific
> > > > client instances.
> > > >
> > > > I'll add the request_type label to the relevant metrics.
> > > >
> > > > Thanks,
> > > > Magnus
> > > >
> > > >
> > > > Den tis 23 nov. 2021 kl 19:20 skrev Bob Barrett
> > > > :
> > > >
> > > > > Hi Magnus,
> > > > >
> > > > > Thanks for the thorough KIP, this seems very useful.
> > > > >
> > > > > Would it make sense to include the request type as a label for the
> > > > > `client.request.success`, `client.request.errors` and
> > > > `client.request.rtt`
> > > > > metrics? I think it would be very useful to see which specific
> > requests
> > > > are
> > > > > succeeding and failing for a client. One specific case I can think
> of
> > > > where
> > > > > this could be useful is producer batch timeouts. If a Java
> > application
> > > > does
> > > > > not enable producer client logs (unfortunately, in my experience
> this
> > > > > happens more often than it should), the application logs will only
> > > > contain
> > > > > the expiration error message, but no information about what is
> > causing
> > > > the
> > > > > timeout. The requests might all be succeeding but taking too long
> to
> > > > > process batches, or metadata requests might be failing, or some or
> > all
> > > > > produce requests might be failing (if the bootstrap servers are
> > reachable
> > > > > from the client but one or more other brokers are not, for
> example).
> > If
> > > > the
> > > > > cluster operator is able to identify the specific requests that are
> > slow
> > > > or
> > > > > failing for a client, they will be better able to diagnose the
> issue
> > > > > causing batch timeouts.
> > > > >
> > > > > One drawback I can think of is that this will increase the
> > cardinality of
> > > > > the request metrics. But any given client is only going to use a
> > small
> > > > > subset of the request types, and since we already have partition
> > labels
> > > > for
> > > > > the topic-level metrics, I think request labels will still make up
> a
> > > > > relatively small percentage of the set of metrics.
> > > > >
> > > > > Thanks,
> > > > > Bob
> > > > >
> > > > > On Mon, Nov 22, 2021 at 2:08 AM Viktor Somogyi-Vass <
> > > > > viktorsomo...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Magnus,
> > > > > >
> > > > > > I think this is a very useful addition. We also have a similar
> (but
> > > > much
> > > > > > more simplistic) implementation of this. Maybe I missed it in the
> > KIP
> > > > but
> > > > > > what about adding metrics about the subscription cache itself?
> > That I
> > > > > think
> > > > > > would improve its usability and debuggability as we'd be able to
> > see
> > > > its
> > > > > > performance, hit/miss rates, eviction counts and others.
> > > > > >
> > > > > > Best,
> > > > > > Viktor
> > > > > >
> > > > > > On Thu, Nov 18, 2021 at 5:12 PM Magnus Edenhill <
> > mag...@edenhill.se>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Mickael,
> > > > > > >
> > > > > > > see 

Re: [DISCUSS] KIP-714: Client metrics and observability

2021-12-13 Thread Ryanne Dolan
I think we should be very careful about introducing new runtime
dependencies into the clients. Historically this has been rare and
essentially necessary (e.g. compression libs).

Ryanne

On Mon, Dec 13, 2021, 1:06 PM Kirk True  wrote:

> Hi Jun,
>
> On Thu, Dec 9, 2021, at 2:28 PM, Jun Rao wrote:
> > 13. Using OpenTelemetry. Does that require runtime dependency
> > on OpenTelemetry library? How good is the compatibility story
> > of OpenTelemetry? This is important since an application could have other
> > OpenTelemetry dependencies than the Kafka client.
>
> The current design is that the OpenTelemetry JARs would ship with the
> client. Perhaps we can design the client such that the JARs aren't even
> loaded if the user has opted out. The user could even exclude the JARs from
> their dependencies if they so wished.
>
> I can't speak to the compatibility of the libraries. Is it possible that
> we include a shaded version?
>
> Thanks,
> Kirk
>
> >
> > 14. The proposal listed idempotence=true. This is more of a configuration
> > than a metric. Are we including that as a metric? What other
> configurations
> > are we including? Should we separate the configurations from the metrics?
> >
> > Thanks,
> >
> > Jun
> >
> > On Mon, Nov 29, 2021 at 7:34 AM Magnus Edenhill 
> wrote:
> >
> > > Hey Bob,
> > >
> > > That's a good point.
> > >
> > > Request type labels were considered but since they're already tracked
> by
> > > broker-side metrics
> > > they were left out as to avoid metric duplication, however those
> metrics
> > > are not per connection,
> > > so they won't be that useful in practice for troubleshooting specific
> > > client instances.
> > >
> > > I'll add the request_type label to the relevant metrics.
> > >
> > > Thanks,
> > > Magnus
> > >
> > >
> > > Den tis 23 nov. 2021 kl 19:20 skrev Bob Barrett
> > > :
> > >
> > > > Hi Magnus,
> > > >
> > > > Thanks for the thorough KIP, this seems very useful.
> > > >
> > > > Would it make sense to include the request type as a label for the
> > > > `client.request.success`, `client.request.errors` and
> > > `client.request.rtt`
> > > > metrics? I think it would be very useful to see which specific
> requests
> > > are
> > > > succeeding and failing for a client. One specific case I can think of
> > > where
> > > > this could be useful is producer batch timeouts. If a Java
> application
> > > does
> > > > not enable producer client logs (unfortunately, in my experience this
> > > > happens more often than it should), the application logs will only
> > > contain
> > > > the expiration error message, but no information about what is
> causing
> > > the
> > > > timeout. The requests might all be succeeding but taking too long to
> > > > process batches, or metadata requests might be failing, or some or
> all
> > > > produce requests might be failing (if the bootstrap servers are
> reachable
> > > > from the client but one or more other brokers are not, for example).
> If
> > > the
> > > > cluster operator is able to identify the specific requests that are
> slow
> > > or
> > > > failing for a client, they will be better able to diagnose the issue
> > > > causing batch timeouts.
> > > >
> > > > One drawback I can think of is that this will increase the
> cardinality of
> > > > the request metrics. But any given client is only going to use a
> small
> > > > subset of the request types, and since we already have partition
> labels
> > > for
> > > > the topic-level metrics, I think request labels will still make up a
> > > > relatively small percentage of the set of metrics.
> > > >
> > > > Thanks,
> > > > Bob
> > > >
> > > > On Mon, Nov 22, 2021 at 2:08 AM Viktor Somogyi-Vass <
> > > > viktorsomo...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Magnus,
> > > > >
> > > > > I think this is a very useful addition. We also have a similar (but
> > > much
> > > > > more simplistic) implementation of this. Maybe I missed it in the
> KIP
> > > but
> > > > > what about adding metrics about the subscription cache itself?
> That I
> > > > think
> > > > > would improve its usability and debuggability as we'd be able to
> see
> > > its
> > > > > performance, hit/miss rates, eviction counts and others.
> > > > >
> > > > > Best,
> > > > > Viktor
> > > > >
> > > > > On Thu, Nov 18, 2021 at 5:12 PM Magnus Edenhill <
> mag...@edenhill.se>
> > > > > wrote:
> > > > >
> > > > > > Hi Mickael,
> > > > > >
> > > > > > see inline.
> > > > > >
> > > > > > Den ons 10 nov. 2021 kl 15:21 skrev Mickael Maison <
> > > > > > mickael.mai...@gmail.com
> > > > > > >:
> > > > > >
> > > > > > > Hi Magnus,
> > > > > > >
> > > > > > > I see you've addressed some of the points I raised above but
> some
> > > (4,
> > > > > > > 5) have not been addressed yet.
> > > > > > >
> > > > > >
> > > > > > Re 4) How will the user/app know metrics are being sent.
> > > > > >
> > > > > > One possibility is to add a JMX metric (thus for user
> consumption)
> > > for
> > > > > the
> > > > > > number of metric pushes the

Re: [DISCUSS] KIP-714: Client metrics and observability

2021-12-13 Thread Kirk True
Hi Jun,

On Thu, Dec 9, 2021, at 2:28 PM, Jun Rao wrote:
> 13. Using OpenTelemetry. Does that require runtime dependency
> on OpenTelemetry library? How good is the compatibility story
> of OpenTelemetry? This is important since an application could have other
> OpenTelemetry dependencies than the Kafka client.

The current design is that the OpenTelemetry JARs would ship with the client. 
Perhaps we can design the client such that the JARs aren't even loaded if the 
user has opted out. The user could even exclude the JARs from their 
dependencies if they so wished.

I can't speak to the compatibility of the libraries. Is it possible that we 
include a shaded version?

Thanks,
Kirk

> 
> 14. The proposal listed idempotence=true. This is more of a configuration
> than a metric. Are we including that as a metric? What other configurations
> are we including? Should we separate the configurations from the metrics?
> 
> Thanks,
> 
> Jun
> 
> On Mon, Nov 29, 2021 at 7:34 AM Magnus Edenhill  wrote:
> 
> > Hey Bob,
> >
> > That's a good point.
> >
> > Request type labels were considered but since they're already tracked by
> > broker-side metrics
> > they were left out as to avoid metric duplication, however those metrics
> > are not per connection,
> > so they won't be that useful in practice for troubleshooting specific
> > client instances.
> >
> > I'll add the request_type label to the relevant metrics.
> >
> > Thanks,
> > Magnus
> >
> >
> > Den tis 23 nov. 2021 kl 19:20 skrev Bob Barrett
> > :
> >
> > > Hi Magnus,
> > >
> > > Thanks for the thorough KIP, this seems very useful.
> > >
> > > Would it make sense to include the request type as a label for the
> > > `client.request.success`, `client.request.errors` and
> > `client.request.rtt`
> > > metrics? I think it would be very useful to see which specific requests
> > are
> > > succeeding and failing for a client. One specific case I can think of
> > where
> > > this could be useful is producer batch timeouts. If a Java application
> > does
> > > not enable producer client logs (unfortunately, in my experience this
> > > happens more often than it should), the application logs will only
> > contain
> > > the expiration error message, but no information about what is causing
> > the
> > > timeout. The requests might all be succeeding but taking too long to
> > > process batches, or metadata requests might be failing, or some or all
> > > produce requests might be failing (if the bootstrap servers are reachable
> > > from the client but one or more other brokers are not, for example). If
> > the
> > > cluster operator is able to identify the specific requests that are slow
> > or
> > > failing for a client, they will be better able to diagnose the issue
> > > causing batch timeouts.
> > >
> > > One drawback I can think of is that this will increase the cardinality of
> > > the request metrics. But any given client is only going to use a small
> > > subset of the request types, and since we already have partition labels
> > for
> > > the topic-level metrics, I think request labels will still make up a
> > > relatively small percentage of the set of metrics.
> > >
> > > Thanks,
> > > Bob
> > >
> > > On Mon, Nov 22, 2021 at 2:08 AM Viktor Somogyi-Vass <
> > > viktorsomo...@gmail.com>
> > > wrote:
> > >
> > > > Hi Magnus,
> > > >
> > > > I think this is a very useful addition. We also have a similar (but
> > much
> > > > more simplistic) implementation of this. Maybe I missed it in the KIP
> > but
> > > > what about adding metrics about the subscription cache itself? That I
> > > think
> > > > would improve its usability and debuggability as we'd be able to see
> > its
> > > > performance, hit/miss rates, eviction counts and others.
> > > >
> > > > Best,
> > > > Viktor
> > > >
> > > > On Thu, Nov 18, 2021 at 5:12 PM Magnus Edenhill 
> > > > wrote:
> > > >
> > > > > Hi Mickael,
> > > > >
> > > > > see inline.
> > > > >
> > > > > Den ons 10 nov. 2021 kl 15:21 skrev Mickael Maison <
> > > > > mickael.mai...@gmail.com
> > > > > >:
> > > > >
> > > > > > Hi Magnus,
> > > > > >
> > > > > > I see you've addressed some of the points I raised above but some
> > (4,
> > > > > > 5) have not been addressed yet.
> > > > > >
> > > > >
> > > > > Re 4) How will the user/app know metrics are being sent.
> > > > >
> > > > > One possibility is to add a JMX metric (thus for user consumption)
> > for
> > > > the
> > > > > number of metric pushes the
> > > > > client has performed, or perhaps the number of metrics subscriptions
> > > > > currently being collected.
> > > > > Would that be sufficient?
> > > > >
> > > > > Re 5) Metric sizes and rates
> > > > >
> > > > > A worst case scenario for a producer that is producing to 50 unique
> > > > topics
> > > > > and emitting all standard metrics yields
> > > > > a serialized size of around 100KB prior to compression, which
> > > compresses
> > > > > down to about 20-30% of that depending
> > > > > on compression type and topic name 

Re: [DISCUSS] KIP-714: Client metrics and observability

2021-12-09 Thread Jun Rao
Hi, Magnus.

Thanks for the KIP. A few comments below.

10. There seems to be some questions on the use cases of this KIP since we
already have a client side metric reporter. It would be useful to provide a
bit more details on that. To me, there are 3 potential use cases: (1) not
all organizations are enforcing client side metric collections; (2) if the
data is shared among 3rd parties, there is less control on external
clients; (3) when Kafka is offered as a hosted service. It would also be
useful to outline the client problems this KIP can help identify. For
example, this KIP may not help with any client connectivity problems.

11. Have we considered sending the metrics with the existing produce
request to an internal topic instead of a new request PushTelemetryRequest?
The potential benefits are (1) reusing existing request's support on
compression, throughput throttling, etc and (2) we could potentially get
rid of ClientTelemetryReceiver. Once the metrics land in a Kafka topic, the
operator can decide what to do with it by just consuming the topic.

12. It seems that we are defining a set of common metric names that every
client needs to support. Are most non-java clients following the naming
convention of the java client metrics? If not, forcing them to all change
their metric names could be destructive.

13. Using OpenTelemetry. Does that require runtime dependency
on OpenTelemetry library? How good is the compatibility story
of OpenTelemetry? This is important since an application could have other
OpenTelemetry dependencies than the Kafka client.

14. The proposal listed idempotence=true. This is more of a configuration
than a metric. Are we including that as a metric? What other configurations
are we including? Should we separate the configurations from the metrics?

Thanks,

Jun

On Mon, Nov 29, 2021 at 7:34 AM Magnus Edenhill  wrote:

> Hey Bob,
>
> That's a good point.
>
> Request type labels were considered but since they're already tracked by
> broker-side metrics
> they were left out as to avoid metric duplication, however those metrics
> are not per connection,
> so they won't be that useful in practice for troubleshooting specific
> client instances.
>
> I'll add the request_type label to the relevant metrics.
>
> Thanks,
> Magnus
>
>
> Den tis 23 nov. 2021 kl 19:20 skrev Bob Barrett
> :
>
> > Hi Magnus,
> >
> > Thanks for the thorough KIP, this seems very useful.
> >
> > Would it make sense to include the request type as a label for the
> > `client.request.success`, `client.request.errors` and
> `client.request.rtt`
> > metrics? I think it would be very useful to see which specific requests
> are
> > succeeding and failing for a client. One specific case I can think of
> where
> > this could be useful is producer batch timeouts. If a Java application
> does
> > not enable producer client logs (unfortunately, in my experience this
> > happens more often than it should), the application logs will only
> contain
> > the expiration error message, but no information about what is causing
> the
> > timeout. The requests might all be succeeding but taking too long to
> > process batches, or metadata requests might be failing, or some or all
> > produce requests might be failing (if the bootstrap servers are reachable
> > from the client but one or more other brokers are not, for example). If
> the
> > cluster operator is able to identify the specific requests that are slow
> or
> > failing for a client, they will be better able to diagnose the issue
> > causing batch timeouts.
> >
> > One drawback I can think of is that this will increase the cardinality of
> > the request metrics. But any given client is only going to use a small
> > subset of the request types, and since we already have partition labels
> for
> > the topic-level metrics, I think request labels will still make up a
> > relatively small percentage of the set of metrics.
> >
> > Thanks,
> > Bob
> >
> > On Mon, Nov 22, 2021 at 2:08 AM Viktor Somogyi-Vass <
> > viktorsomo...@gmail.com>
> > wrote:
> >
> > > Hi Magnus,
> > >
> > > I think this is a very useful addition. We also have a similar (but
> much
> > > more simplistic) implementation of this. Maybe I missed it in the KIP
> but
> > > what about adding metrics about the subscription cache itself? That I
> > think
> > > would improve its usability and debuggability as we'd be able to see
> its
> > > performance, hit/miss rates, eviction counts and others.
> > >
> > > Best,
> > > Viktor
> > >
> > > On Thu, Nov 18, 2021 at 5:12 PM Magnus Edenhill 
> > > wrote:
> > >
> > > > Hi Mickael,
> > > >
> > > > see inline.
> > > >
> > > > Den ons 10 nov. 2021 kl 15:21 skrev Mickael Maison <
> > > > mickael.mai...@gmail.com
> > > > >:
> > > >
> > > > > Hi Magnus,
> > > > >
> > > > > I see you've addressed some of the points I raised above but some
> (4,
> > > > > 5) have not been addressed yet.
> > > > >
> > > >
> > > > Re 4) How will the user/app know metrics are being sent.
> > > >
> > > > 

Re: [DISCUSS] KIP-714: Client metrics and observability

2021-11-29 Thread Magnus Edenhill
Hey Bob,

That's a good point.

Request type labels were considered but since they're already tracked by
broker-side metrics
they were left out as to avoid metric duplication, however those metrics
are not per connection,
so they won't be that useful in practice for troubleshooting specific
client instances.

I'll add the request_type label to the relevant metrics.

Thanks,
Magnus


Den tis 23 nov. 2021 kl 19:20 skrev Bob Barrett
:

> Hi Magnus,
>
> Thanks for the thorough KIP, this seems very useful.
>
> Would it make sense to include the request type as a label for the
> `client.request.success`, `client.request.errors` and `client.request.rtt`
> metrics? I think it would be very useful to see which specific requests are
> succeeding and failing for a client. One specific case I can think of where
> this could be useful is producer batch timeouts. If a Java application does
> not enable producer client logs (unfortunately, in my experience this
> happens more often than it should), the application logs will only contain
> the expiration error message, but no information about what is causing the
> timeout. The requests might all be succeeding but taking too long to
> process batches, or metadata requests might be failing, or some or all
> produce requests might be failing (if the bootstrap servers are reachable
> from the client but one or more other brokers are not, for example). If the
> cluster operator is able to identify the specific requests that are slow or
> failing for a client, they will be better able to diagnose the issue
> causing batch timeouts.
>
> One drawback I can think of is that this will increase the cardinality of
> the request metrics. But any given client is only going to use a small
> subset of the request types, and since we already have partition labels for
> the topic-level metrics, I think request labels will still make up a
> relatively small percentage of the set of metrics.
>
> Thanks,
> Bob
>
> On Mon, Nov 22, 2021 at 2:08 AM Viktor Somogyi-Vass <
> viktorsomo...@gmail.com>
> wrote:
>
> > Hi Magnus,
> >
> > I think this is a very useful addition. We also have a similar (but much
> > more simplistic) implementation of this. Maybe I missed it in the KIP but
> > what about adding metrics about the subscription cache itself? That I
> think
> > would improve its usability and debuggability as we'd be able to see its
> > performance, hit/miss rates, eviction counts and others.
> >
> > Best,
> > Viktor
> >
> > On Thu, Nov 18, 2021 at 5:12 PM Magnus Edenhill 
> > wrote:
> >
> > > Hi Mickael,
> > >
> > > see inline.
> > >
> > > Den ons 10 nov. 2021 kl 15:21 skrev Mickael Maison <
> > > mickael.mai...@gmail.com
> > > >:
> > >
> > > > Hi Magnus,
> > > >
> > > > I see you've addressed some of the points I raised above but some (4,
> > > > 5) have not been addressed yet.
> > > >
> > >
> > > Re 4) How will the user/app know metrics are being sent.
> > >
> > > One possibility is to add a JMX metric (thus for user consumption) for
> > the
> > > number of metric pushes the
> > > client has performed, or perhaps the number of metrics subscriptions
> > > currently being collected.
> > > Would that be sufficient?
> > >
> > > Re 5) Metric sizes and rates
> > >
> > > A worst case scenario for a producer that is producing to 50 unique
> > topics
> > > and emitting all standard metrics yields
> > > a serialized size of around 100KB prior to compression, which
> compresses
> > > down to about 20-30% of that depending
> > > on compression type and topic name uniqueness.
> > > The numbers for a consumer would be similar.
> > >
> > > In practice the number of unique topics would be far less, and the
> > > subscription set would typically be for a subset of metrics.
> > > So we're probably closer to 1kb, or less, compressed size per client
> per
> > > push interval.
> > >
> > > As both the subscription set and push intervals are controlled by the
> > > cluster operator it shouldn't be too hard
> > > to strike a good balance between metrics overhead and granularity.
> > >
> > >
> > >
> > > >
> > > > I'm really uneasy with this being enabled by default on the client
> > > > side. When collecting data, I think the best practice is to ensure
> > > > users are explicitly enabling it.
> > > >
> > >
> > > Requiring metrics to be explicitly enabled on clients severely cripples
> > its
> > > usability and value.
> > >
> > > One of the problems that this KIP aims to solve is for useful metrics
> to
> > be
> > > available on demand
> > > regardless of the technical expertise of the user. As Ryanne points,
> out
> > a
> > > savvy user/organization
> > > will typically have metrics collection and monitoring in place already,
> > and
> > > the benefits of this KIP
> > > are then more of a common set and format metrics across client
> > > implementations and languages.
> > > But that is not the typical Kafka user in my experience, they're not
> > Kafka
> > > experts and they don't have the
> > > knowledge of how 

Re: [DISCUSS] KIP-714: Client metrics and observability

2021-11-29 Thread Magnus Edenhill
Hi Viktor,

that's a good idea, I've added a bunch of broker-side metrics for the
client metrics handling.
There might be more added during development as the need arise.

https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability#KIP714:Clientmetricsandobservability-Newbrokermetrics

Thanks,
Magnus

Den mån 22 nov. 2021 kl 11:08 skrev Viktor Somogyi-Vass <
viktorsomo...@gmail.com>:

> Hi Magnus,
>
> I think this is a very useful addition. We also have a similar (but much
> more simplistic) implementation of this. Maybe I missed it in the KIP but
> what about adding metrics about the subscription cache itself? That I think
> would improve its usability and debuggability as we'd be able to see its
> performance, hit/miss rates, eviction counts and others.
>
> Best,
> Viktor
>
> On Thu, Nov 18, 2021 at 5:12 PM Magnus Edenhill 
> wrote:
>
> > Hi Mickael,
> >
> > see inline.
> >
> > Den ons 10 nov. 2021 kl 15:21 skrev Mickael Maison <
> > mickael.mai...@gmail.com
> > >:
> >
> > > Hi Magnus,
> > >
> > > I see you've addressed some of the points I raised above but some (4,
> > > 5) have not been addressed yet.
> > >
> >
> > Re 4) How will the user/app know metrics are being sent.
> >
> > One possibility is to add a JMX metric (thus for user consumption) for
> the
> > number of metric pushes the
> > client has performed, or perhaps the number of metrics subscriptions
> > currently being collected.
> > Would that be sufficient?
> >
> > Re 5) Metric sizes and rates
> >
> > A worst case scenario for a producer that is producing to 50 unique
> topics
> > and emitting all standard metrics yields
> > a serialized size of around 100KB prior to compression, which compresses
> > down to about 20-30% of that depending
> > on compression type and topic name uniqueness.
> > The numbers for a consumer would be similar.
> >
> > In practice the number of unique topics would be far less, and the
> > subscription set would typically be for a subset of metrics.
> > So we're probably closer to 1kb, or less, compressed size per client per
> > push interval.
> >
> > As both the subscription set and push intervals are controlled by the
> > cluster operator it shouldn't be too hard
> > to strike a good balance between metrics overhead and granularity.
> >
> >
> >
> > >
> > > I'm really uneasy with this being enabled by default on the client
> > > side. When collecting data, I think the best practice is to ensure
> > > users are explicitly enabling it.
> > >
> >
> > Requiring metrics to be explicitly enabled on clients severely cripples
> its
> > usability and value.
> >
> > One of the problems that this KIP aims to solve is for useful metrics to
> be
> > available on demand
> > regardless of the technical expertise of the user. As Ryanne points, out
> a
> > savvy user/organization
> > will typically have metrics collection and monitoring in place already,
> and
> > the benefits of this KIP
> > are then more of a common set and format metrics across client
> > implementations and languages.
> > But that is not the typical Kafka user in my experience, they're not
> Kafka
> > experts and they don't have the
> > knowledge of how to best instrument their clients.
> > Having metrics enabled by default for this user base allows the Kafka
> > operators to proactively and reactively
> > monitor and troubleshoot client issues, without the need for the less
> savvy
> > user to do anything.
> > It is often too late to tell a user to enable metrics when the problem
> has
> > already occurred.
> >
> > Now, to be clear, even though metrics are enabled by default on clients
> it
> > is not enabled by default
> > on the brokers; the Kafka operator needs to build and set up a metrics
> > plugin and add metrics subscriptions
> > before anything is sent from the client.
> > It is opt-out on the clients and opt-in on the broker.
> >
> >
> >
> >
> > > You mentioned brokers already have
> > > some(most?) of the information contained in metrics, if so then why
> > > are we collecting it again? Surely there must be some new information
> > > in the client metrics.
> > >
> >
> > From the user's perspective the Kafka infrastructure extends from
> > producer.send() to
> > messages being returned from consumer.poll(), a giant black box where
> > there's a lot going on between those
> > two points. The brokers currently only see what happens once those
> requests
> > and messages hits the broker,
> > but as Kafka clients are complex pieces of machinery there's a myriad of
> > queues, timers, and state
> > that's critical to the operation and infrastructure that's not currently
> > visible to the operator.
> > Relying on the user to accurately and timely provide this missing
> > information is not generally feasible.
> >
> >
> > Most of the standard metrics listed in the KIP are data points that the
> > broker does not have.
> > Only a small number of metrics are duplicates (like the request counts
> and
> > sizes), but they are 

Re: [DISCUSS] KIP-714: Client metrics and observability

2021-11-23 Thread Bob Barrett
Hi Magnus,

Thanks for the thorough KIP, this seems very useful.

Would it make sense to include the request type as a label for the
`client.request.success`, `client.request.errors` and `client.request.rtt`
metrics? I think it would be very useful to see which specific requests are
succeeding and failing for a client. One specific case I can think of where
this could be useful is producer batch timeouts. If a Java application does
not enable producer client logs (unfortunately, in my experience this
happens more often than it should), the application logs will only contain
the expiration error message, but no information about what is causing the
timeout. The requests might all be succeeding but taking too long to
process batches, or metadata requests might be failing, or some or all
produce requests might be failing (if the bootstrap servers are reachable
from the client but one or more other brokers are not, for example). If the
cluster operator is able to identify the specific requests that are slow or
failing for a client, they will be better able to diagnose the issue
causing batch timeouts.

One drawback I can think of is that this will increase the cardinality of
the request metrics. But any given client is only going to use a small
subset of the request types, and since we already have partition labels for
the topic-level metrics, I think request labels will still make up a
relatively small percentage of the set of metrics.

Thanks,
Bob

On Mon, Nov 22, 2021 at 2:08 AM Viktor Somogyi-Vass 
wrote:

> Hi Magnus,
>
> I think this is a very useful addition. We also have a similar (but much
> more simplistic) implementation of this. Maybe I missed it in the KIP but
> what about adding metrics about the subscription cache itself? That I think
> would improve its usability and debuggability as we'd be able to see its
> performance, hit/miss rates, eviction counts and others.
>
> Best,
> Viktor
>
> On Thu, Nov 18, 2021 at 5:12 PM Magnus Edenhill 
> wrote:
>
> > Hi Mickael,
> >
> > see inline.
> >
> > Den ons 10 nov. 2021 kl 15:21 skrev Mickael Maison <
> > mickael.mai...@gmail.com
> > >:
> >
> > > Hi Magnus,
> > >
> > > I see you've addressed some of the points I raised above but some (4,
> > > 5) have not been addressed yet.
> > >
> >
> > Re 4) How will the user/app know metrics are being sent.
> >
> > One possibility is to add a JMX metric (thus for user consumption) for
> the
> > number of metric pushes the
> > client has performed, or perhaps the number of metrics subscriptions
> > currently being collected.
> > Would that be sufficient?
> >
> > Re 5) Metric sizes and rates
> >
> > A worst case scenario for a producer that is producing to 50 unique
> topics
> > and emitting all standard metrics yields
> > a serialized size of around 100KB prior to compression, which compresses
> > down to about 20-30% of that depending
> > on compression type and topic name uniqueness.
> > The numbers for a consumer would be similar.
> >
> > In practice the number of unique topics would be far less, and the
> > subscription set would typically be for a subset of metrics.
> > So we're probably closer to 1kb, or less, compressed size per client per
> > push interval.
> >
> > As both the subscription set and push intervals are controlled by the
> > cluster operator it shouldn't be too hard
> > to strike a good balance between metrics overhead and granularity.
> >
> >
> >
> > >
> > > I'm really uneasy with this being enabled by default on the client
> > > side. When collecting data, I think the best practice is to ensure
> > > users are explicitly enabling it.
> > >
> >
> > Requiring metrics to be explicitly enabled on clients severely cripples
> its
> > usability and value.
> >
> > One of the problems that this KIP aims to solve is for useful metrics to
> be
> > available on demand
> > regardless of the technical expertise of the user. As Ryanne points, out
> a
> > savvy user/organization
> > will typically have metrics collection and monitoring in place already,
> and
> > the benefits of this KIP
> > are then more of a common set and format metrics across client
> > implementations and languages.
> > But that is not the typical Kafka user in my experience, they're not
> Kafka
> > experts and they don't have the
> > knowledge of how to best instrument their clients.
> > Having metrics enabled by default for this user base allows the Kafka
> > operators to proactively and reactively
> > monitor and troubleshoot client issues, without the need for the less
> savvy
> > user to do anything.
> > It is often too late to tell a user to enable metrics when the problem
> has
> > already occurred.
> >
> > Now, to be clear, even though metrics are enabled by default on clients
> it
> > is not enabled by default
> > on the brokers; the Kafka operator needs to build and set up a metrics
> > plugin and add metrics subscriptions
> > before anything is sent from the client.
> > It is opt-out on the clients and opt-in on the broker.
> >

Re: [DISCUSS] KIP-714: Client metrics and observability

2021-11-22 Thread Viktor Somogyi-Vass
Hi Magnus,

I think this is a very useful addition. We also have a similar (but much
more simplistic) implementation of this. Maybe I missed it in the KIP but
what about adding metrics about the subscription cache itself? That I think
would improve its usability and debuggability as we'd be able to see its
performance, hit/miss rates, eviction counts and others.

Best,
Viktor

On Thu, Nov 18, 2021 at 5:12 PM Magnus Edenhill  wrote:

> Hi Mickael,
>
> see inline.
>
> Den ons 10 nov. 2021 kl 15:21 skrev Mickael Maison <
> mickael.mai...@gmail.com
> >:
>
> > Hi Magnus,
> >
> > I see you've addressed some of the points I raised above but some (4,
> > 5) have not been addressed yet.
> >
>
> Re 4) How will the user/app know metrics are being sent.
>
> One possibility is to add a JMX metric (thus for user consumption) for the
> number of metric pushes the
> client has performed, or perhaps the number of metrics subscriptions
> currently being collected.
> Would that be sufficient?
>
> Re 5) Metric sizes and rates
>
> A worst case scenario for a producer that is producing to 50 unique topics
> and emitting all standard metrics yields
> a serialized size of around 100KB prior to compression, which compresses
> down to about 20-30% of that depending
> on compression type and topic name uniqueness.
> The numbers for a consumer would be similar.
>
> In practice the number of unique topics would be far less, and the
> subscription set would typically be for a subset of metrics.
> So we're probably closer to 1kb, or less, compressed size per client per
> push interval.
>
> As both the subscription set and push intervals are controlled by the
> cluster operator it shouldn't be too hard
> to strike a good balance between metrics overhead and granularity.
>
>
>
> >
> > I'm really uneasy with this being enabled by default on the client
> > side. When collecting data, I think the best practice is to ensure
> > users are explicitly enabling it.
> >
>
> Requiring metrics to be explicitly enabled on clients severely cripples its
> usability and value.
>
> One of the problems that this KIP aims to solve is for useful metrics to be
> available on demand
> regardless of the technical expertise of the user. As Ryanne points, out a
> savvy user/organization
> will typically have metrics collection and monitoring in place already, and
> the benefits of this KIP
> are then more of a common set and format metrics across client
> implementations and languages.
> But that is not the typical Kafka user in my experience, they're not Kafka
> experts and they don't have the
> knowledge of how to best instrument their clients.
> Having metrics enabled by default for this user base allows the Kafka
> operators to proactively and reactively
> monitor and troubleshoot client issues, without the need for the less savvy
> user to do anything.
> It is often too late to tell a user to enable metrics when the problem has
> already occurred.
>
> Now, to be clear, even though metrics are enabled by default on clients it
> is not enabled by default
> on the brokers; the Kafka operator needs to build and set up a metrics
> plugin and add metrics subscriptions
> before anything is sent from the client.
> It is opt-out on the clients and opt-in on the broker.
>
>
>
>
> > You mentioned brokers already have
> > some(most?) of the information contained in metrics, if so then why
> > are we collecting it again? Surely there must be some new information
> > in the client metrics.
> >
>
> From the user's perspective the Kafka infrastructure extends from
> producer.send() to
> messages being returned from consumer.poll(), a giant black box where
> there's a lot going on between those
> two points. The brokers currently only see what happens once those requests
> and messages hits the broker,
> but as Kafka clients are complex pieces of machinery there's a myriad of
> queues, timers, and state
> that's critical to the operation and infrastructure that's not currently
> visible to the operator.
> Relying on the user to accurately and timely provide this missing
> information is not generally feasible.
>
>
> Most of the standard metrics listed in the KIP are data points that the
> broker does not have.
> Only a small number of metrics are duplicates (like the request counts and
> sizes), but they are included
> to ease correlation when inspecting these client metrics.
>
>
>
> > Moreover this is a brand new feature so it's even harder to justify
> > enabling it and forcing onto all our users. If disabled by default,
> > it's relatively easy to enable in a new release if we decide to, but
> > once enabled by default it's much harder to disable. Also this feature
> > will apply to all future metrics we will add.
> >
>
> I think maturity of a feature implementation should be the deciding factor,
> rather than
> the design of it (which this KIP is). I.e., if the implementation is not
> deemed mature enough
> for release X.Y it will be disabled.
>
>
>
> > Overall I think 

Re: [DISCUSS] KIP-714: Client metrics and observability

2021-11-18 Thread Magnus Edenhill
Hi Mickael,

see inline.

Den ons 10 nov. 2021 kl 15:21 skrev Mickael Maison :

> Hi Magnus,
>
> I see you've addressed some of the points I raised above but some (4,
> 5) have not been addressed yet.
>

Re 4) How will the user/app know metrics are being sent.

One possibility is to add a JMX metric (thus for user consumption) for the
number of metric pushes the
client has performed, or perhaps the number of metrics subscriptions
currently being collected.
Would that be sufficient?

Re 5) Metric sizes and rates

A worst case scenario for a producer that is producing to 50 unique topics
and emitting all standard metrics yields
a serialized size of around 100KB prior to compression, which compresses
down to about 20-30% of that depending
on compression type and topic name uniqueness.
The numbers for a consumer would be similar.

In practice the number of unique topics would be far less, and the
subscription set would typically be for a subset of metrics.
So we're probably closer to 1kb, or less, compressed size per client per
push interval.

As both the subscription set and push intervals are controlled by the
cluster operator it shouldn't be too hard
to strike a good balance between metrics overhead and granularity.



>
> I'm really uneasy with this being enabled by default on the client
> side. When collecting data, I think the best practice is to ensure
> users are explicitly enabling it.
>

Requiring metrics to be explicitly enabled on clients severely cripples its
usability and value.

One of the problems that this KIP aims to solve is for useful metrics to be
available on demand
regardless of the technical expertise of the user. As Ryanne points, out a
savvy user/organization
will typically have metrics collection and monitoring in place already, and
the benefits of this KIP
are then more of a common set and format metrics across client
implementations and languages.
But that is not the typical Kafka user in my experience, they're not Kafka
experts and they don't have the
knowledge of how to best instrument their clients.
Having metrics enabled by default for this user base allows the Kafka
operators to proactively and reactively
monitor and troubleshoot client issues, without the need for the less savvy
user to do anything.
It is often too late to tell a user to enable metrics when the problem has
already occurred.

Now, to be clear, even though metrics are enabled by default on clients it
is not enabled by default
on the brokers; the Kafka operator needs to build and set up a metrics
plugin and add metrics subscriptions
before anything is sent from the client.
It is opt-out on the clients and opt-in on the broker.




> You mentioned brokers already have
> some(most?) of the information contained in metrics, if so then why
> are we collecting it again? Surely there must be some new information
> in the client metrics.
>

>From the user's perspective the Kafka infrastructure extends from
producer.send() to
messages being returned from consumer.poll(), a giant black box where
there's a lot going on between those
two points. The brokers currently only see what happens once those requests
and messages hits the broker,
but as Kafka clients are complex pieces of machinery there's a myriad of
queues, timers, and state
that's critical to the operation and infrastructure that's not currently
visible to the operator.
Relying on the user to accurately and timely provide this missing
information is not generally feasible.


Most of the standard metrics listed in the KIP are data points that the
broker does not have.
Only a small number of metrics are duplicates (like the request counts and
sizes), but they are included
to ease correlation when inspecting these client metrics.



> Moreover this is a brand new feature so it's even harder to justify
> enabling it and forcing onto all our users. If disabled by default,
> it's relatively easy to enable in a new release if we decide to, but
> once enabled by default it's much harder to disable. Also this feature
> will apply to all future metrics we will add.
>

I think maturity of a feature implementation should be the deciding factor,
rather than
the design of it (which this KIP is). I.e., if the implementation is not
deemed mature enough
for release X.Y it will be disabled.



> Overall I think it's an interesting feature but I'd prefer to be
> slightly defensive and see how it works in practice before enabling it
> everywhere.
>

Right, and I agree on being defensive, but since this feature still
requires manual
enabling on the brokers before actually being used, I think that gives
enough control
to opt-in or out of this feature as needed.

Thanks for your comments!

Regards,
Magnus



> Thanks,
> Mickael
>
> On Mon, Nov 8, 2021 at 7:55 AM Magnus Edenhill  wrote:
> >
> > Thanks David for pointing this out,
> > I've updated the KIP to include client_id as a matching selector.
> >
> > Regards,
> > Magnus
> >
> > Den tors 4 nov. 2021 kl 18:01 skrev David Mao  >:
> 

Re: [DISCUSS] KIP-714: Client metrics and observability

2021-11-10 Thread Mickael Maison
Hi Magnus,

I see you've addressed some of the points I raised above but some (4,
5) have not been addressed yet.

I'm really uneasy with this being enabled by default on the client
side. When collecting data, I think the best practice is to ensure
users are explicitly enabling it. You mentioned brokers already have
some(most?) of the information contained in metrics, if so then why
are we collecting it again? Surely there must be some new information
in the client metrics.

Moreover this is a brand new feature so it's even harder to justify
enabling it and forcing onto all our users. If disabled by default,
it's relatively easy to enable in a new release if we decide to, but
once enabled by default it's much harder to disable. Also this feature
will apply to all future metrics we will add.

Overall I think it's an interesting feature but I'd prefer to be
slightly defensive and see how it works in practice before enabling it
everywhere.

Thanks,
Mickael

On Mon, Nov 8, 2021 at 7:55 AM Magnus Edenhill  wrote:
>
> Thanks David for pointing this out,
> I've updated the KIP to include client_id as a matching selector.
>
> Regards,
> Magnus
>
> Den tors 4 nov. 2021 kl 18:01 skrev David Mao :
>
> > Hey Magnus,
> >
> > I noticed that the KIP outlines the initial selectors supported as:
> >
> >- client_instance_id - CLIENT_INSTANCE_ID UUID string representation.
> >- client_software_name  - client software implementation name.
> >- client_software_version  - client software implementation version.
> >
> > In the given reactive monitoring workflow, we mention that the application
> > user does not know their client's client instance ID, but it's outlined
> > that the operator can add a metrics subscription selecting for clientId. I
> > don't see clientId as one of the supported selectors.
> > I can see how this would have made sense in a previous iteration given that
> > the previous client instance ID proposal was to construct the client
> > instance ID using clientId as a prefix. Now that the client instance ID is
> > a UUID, would we want to add clientId as a supported selector?
> > Let me know what you think.
> >
> > David
> >
> > On Tue, Oct 19, 2021 at 12:33 PM Magnus Edenhill 
> > wrote:
> >
> > > Hi Mickael!
> > >
> > > Den tis 19 okt. 2021 kl 19:30 skrev Mickael Maison <
> > > mickael.mai...@gmail.com
> > > >:
> > >
> > > > Hi Magnus,
> > > >
> > > > Thanks for the proposal.
> > > >
> > > > 1. Looking at the protocol section, isn't "ClientInstanceId" expected
> > > > to be a field in GetTelemetrySubscriptionsResponseV0? Otherwise, how
> > > > does a client retrieve this value?
> > > >
> > >
> > > Good catch, it got removed by mistake in one of the edits.
> > >
> > >
> > > >
> > > > 2. In the client API section, you mention a new method
> > > > "clientInstanceId()". Can you clarify which interfaces are affected?
> > > > Is it only Consumer and Producer?
> > > >
> > >
> > > And Admin. Will update the KIP.
> > >
> > >
> > >
> > > > 3. I'm a bit concerned this is enabled by default. Even if the data
> > > > collected is supposed to be not sensitive, I think this can be
> > > > problematic in some environments. Also users don't seem to have the
> > > > choice to only expose some metrics. Knowing how much data transit
> > > > through some applications can be considered critical.
> > > >
> > >
> > > The broker already knows how much data transits through the client
> > though,
> > > right?
> > > Care has been taken not to expose information in the standard metrics
> > that
> > > might
> > > reveal sensitive information.
> > >
> > > Do you have an example of how the proposed metrics could leak sensitive
> > > information?
> > > As for limiting the what metrics to export; I guess that could make sense
> > > in some
> > > very sensitive use-cases, but those users might disable metrics
> > altogether
> > > for now.
> > > Could these concerns be addressed by a later KIP?
> > >
> > >
> > >
> > > >
> > > > 4. As a user, how do you know if your application is actively sending
> > > > metrics? Are there new metrics exposing what's going on, like how much
> > > > data is being sent?
> > > >
> > >
> > > That's a good question.
> > > Since the proposed metrics interface is not aimed at, or directly
> > available
> > > to, the application
> > > I guess there's little point of adding it here, but instead adding
> > > something to the
> > > existing JMX metrics?
> > > Do you have any suggestions?
> > >
> > >
> > >
> > > > 5. If all metrics are enabled on a regular Consumer or Producer, do
> > > > you have an idea how much throughput this would use?
> > > >
> > >
> > > It depends on the number of partition/topics/etc the client is producing
> > > to/consuming from.
> > > I'll add some sizes to the KIP for some typical use-cases.
> > >
> > >
> > > Thanks,
> > > Magnus
> > >
> > >
> > > > Thanks
> > > >
> > > > On Tue, Oct 19, 2021 at 5:06 PM Magnus Edenhill 
> > > > wrote:
> > > > >
> > > > > Den tis 19 okt. 2021 kl 

Re: [DISCUSS] KIP-714: Client metrics and observability

2021-11-07 Thread Magnus Edenhill
Thanks David for pointing this out,
I've updated the KIP to include client_id as a matching selector.

Regards,
Magnus

Den tors 4 nov. 2021 kl 18:01 skrev David Mao :

> Hey Magnus,
>
> I noticed that the KIP outlines the initial selectors supported as:
>
>- client_instance_id - CLIENT_INSTANCE_ID UUID string representation.
>- client_software_name  - client software implementation name.
>- client_software_version  - client software implementation version.
>
> In the given reactive monitoring workflow, we mention that the application
> user does not know their client's client instance ID, but it's outlined
> that the operator can add a metrics subscription selecting for clientId. I
> don't see clientId as one of the supported selectors.
> I can see how this would have made sense in a previous iteration given that
> the previous client instance ID proposal was to construct the client
> instance ID using clientId as a prefix. Now that the client instance ID is
> a UUID, would we want to add clientId as a supported selector?
> Let me know what you think.
>
> David
>
> On Tue, Oct 19, 2021 at 12:33 PM Magnus Edenhill 
> wrote:
>
> > Hi Mickael!
> >
> > Den tis 19 okt. 2021 kl 19:30 skrev Mickael Maison <
> > mickael.mai...@gmail.com
> > >:
> >
> > > Hi Magnus,
> > >
> > > Thanks for the proposal.
> > >
> > > 1. Looking at the protocol section, isn't "ClientInstanceId" expected
> > > to be a field in GetTelemetrySubscriptionsResponseV0? Otherwise, how
> > > does a client retrieve this value?
> > >
> >
> > Good catch, it got removed by mistake in one of the edits.
> >
> >
> > >
> > > 2. In the client API section, you mention a new method
> > > "clientInstanceId()". Can you clarify which interfaces are affected?
> > > Is it only Consumer and Producer?
> > >
> >
> > And Admin. Will update the KIP.
> >
> >
> >
> > > 3. I'm a bit concerned this is enabled by default. Even if the data
> > > collected is supposed to be not sensitive, I think this can be
> > > problematic in some environments. Also users don't seem to have the
> > > choice to only expose some metrics. Knowing how much data transit
> > > through some applications can be considered critical.
> > >
> >
> > The broker already knows how much data transits through the client
> though,
> > right?
> > Care has been taken not to expose information in the standard metrics
> that
> > might
> > reveal sensitive information.
> >
> > Do you have an example of how the proposed metrics could leak sensitive
> > information?
> > As for limiting the what metrics to export; I guess that could make sense
> > in some
> > very sensitive use-cases, but those users might disable metrics
> altogether
> > for now.
> > Could these concerns be addressed by a later KIP?
> >
> >
> >
> > >
> > > 4. As a user, how do you know if your application is actively sending
> > > metrics? Are there new metrics exposing what's going on, like how much
> > > data is being sent?
> > >
> >
> > That's a good question.
> > Since the proposed metrics interface is not aimed at, or directly
> available
> > to, the application
> > I guess there's little point of adding it here, but instead adding
> > something to the
> > existing JMX metrics?
> > Do you have any suggestions?
> >
> >
> >
> > > 5. If all metrics are enabled on a regular Consumer or Producer, do
> > > you have an idea how much throughput this would use?
> > >
> >
> > It depends on the number of partition/topics/etc the client is producing
> > to/consuming from.
> > I'll add some sizes to the KIP for some typical use-cases.
> >
> >
> > Thanks,
> > Magnus
> >
> >
> > > Thanks
> > >
> > > On Tue, Oct 19, 2021 at 5:06 PM Magnus Edenhill 
> > > wrote:
> > > >
> > > > Den tis 19 okt. 2021 kl 13:22 skrev Tom Bentley  >:
> > > >
> > > > > Hi Magnus,
> > > > >
> > > > > I reviewed the KIP since you called the vote (sorry for not
> reviewing
> > > when
> > > > > you announced your intention to call the vote). I have a few
> > questions
> > > on
> > > > > some of the details.
> > > > >
> > > > > 1. There's no Javadoc on ClientTelemetryPayload.data(), so I don't
> > know
> > > > > whether the payload is exposed through this method as compressed or
> > > not.
> > > > > Later on you say "Decompression of the payloads will be handled by
> > the
> > > > > broker metrics plugin, the broker should expose a suitable
> > > decompression
> > > > > API to the metrics plugin for this purpose.", which suggests it's
> the
> > > > > compressed data in the buffer, but then we don't know which codec
> was
> > > used,
> > > > > nor the API via which the plugin should decompress it if required
> for
> > > > > forwarding to the ultimate metrics store. Should the
> > > ClientTelemetryPayload
> > > > > expose a method to get the compression and a decompressor?
> > > > >
> > > >
> > > > Good point, updated.
> > > >
> > > >
> > > >
> > > > > 2. The client-side API is expressed as StringOrError
> > > > > ClientInstance::ClientInstanceId(int timeout_ms). I 

Re: [DISCUSS] KIP-714: Client metrics and observability

2021-11-04 Thread David Mao
Hey Magnus,

I noticed that the KIP outlines the initial selectors supported as:

   - client_instance_id - CLIENT_INSTANCE_ID UUID string representation.
   - client_software_name  - client software implementation name.
   - client_software_version  - client software implementation version.

In the given reactive monitoring workflow, we mention that the application
user does not know their client's client instance ID, but it's outlined
that the operator can add a metrics subscription selecting for clientId. I
don't see clientId as one of the supported selectors.
I can see how this would have made sense in a previous iteration given that
the previous client instance ID proposal was to construct the client
instance ID using clientId as a prefix. Now that the client instance ID is
a UUID, would we want to add clientId as a supported selector?
Let me know what you think.

David

On Tue, Oct 19, 2021 at 12:33 PM Magnus Edenhill  wrote:

> Hi Mickael!
>
> Den tis 19 okt. 2021 kl 19:30 skrev Mickael Maison <
> mickael.mai...@gmail.com
> >:
>
> > Hi Magnus,
> >
> > Thanks for the proposal.
> >
> > 1. Looking at the protocol section, isn't "ClientInstanceId" expected
> > to be a field in GetTelemetrySubscriptionsResponseV0? Otherwise, how
> > does a client retrieve this value?
> >
>
> Good catch, it got removed by mistake in one of the edits.
>
>
> >
> > 2. In the client API section, you mention a new method
> > "clientInstanceId()". Can you clarify which interfaces are affected?
> > Is it only Consumer and Producer?
> >
>
> And Admin. Will update the KIP.
>
>
>
> > 3. I'm a bit concerned this is enabled by default. Even if the data
> > collected is supposed to be not sensitive, I think this can be
> > problematic in some environments. Also users don't seem to have the
> > choice to only expose some metrics. Knowing how much data transit
> > through some applications can be considered critical.
> >
>
> The broker already knows how much data transits through the client though,
> right?
> Care has been taken not to expose information in the standard metrics that
> might
> reveal sensitive information.
>
> Do you have an example of how the proposed metrics could leak sensitive
> information?
> As for limiting the what metrics to export; I guess that could make sense
> in some
> very sensitive use-cases, but those users might disable metrics altogether
> for now.
> Could these concerns be addressed by a later KIP?
>
>
>
> >
> > 4. As a user, how do you know if your application is actively sending
> > metrics? Are there new metrics exposing what's going on, like how much
> > data is being sent?
> >
>
> That's a good question.
> Since the proposed metrics interface is not aimed at, or directly available
> to, the application
> I guess there's little point of adding it here, but instead adding
> something to the
> existing JMX metrics?
> Do you have any suggestions?
>
>
>
> > 5. If all metrics are enabled on a regular Consumer or Producer, do
> > you have an idea how much throughput this would use?
> >
>
> It depends on the number of partition/topics/etc the client is producing
> to/consuming from.
> I'll add some sizes to the KIP for some typical use-cases.
>
>
> Thanks,
> Magnus
>
>
> > Thanks
> >
> > On Tue, Oct 19, 2021 at 5:06 PM Magnus Edenhill 
> > wrote:
> > >
> > > Den tis 19 okt. 2021 kl 13:22 skrev Tom Bentley :
> > >
> > > > Hi Magnus,
> > > >
> > > > I reviewed the KIP since you called the vote (sorry for not reviewing
> > when
> > > > you announced your intention to call the vote). I have a few
> questions
> > on
> > > > some of the details.
> > > >
> > > > 1. There's no Javadoc on ClientTelemetryPayload.data(), so I don't
> know
> > > > whether the payload is exposed through this method as compressed or
> > not.
> > > > Later on you say "Decompression of the payloads will be handled by
> the
> > > > broker metrics plugin, the broker should expose a suitable
> > decompression
> > > > API to the metrics plugin for this purpose.", which suggests it's the
> > > > compressed data in the buffer, but then we don't know which codec was
> > used,
> > > > nor the API via which the plugin should decompress it if required for
> > > > forwarding to the ultimate metrics store. Should the
> > ClientTelemetryPayload
> > > > expose a method to get the compression and a decompressor?
> > > >
> > >
> > > Good point, updated.
> > >
> > >
> > >
> > > > 2. The client-side API is expressed as StringOrError
> > > > ClientInstance::ClientInstanceId(int timeout_ms). I understand that
> > you're
> > > > thinking about the librdkafka implementation, but it would be good to
> > show
> > > > the API as it would appear on the Apache Kafka clients.
> > > >
> > >
> > > This was meant as pseudo-code, but I changed it to Java.
> > >
> > >
> > > > 3. "PushTelemetryRequest|Response - protocol request used by the
> > client to
> > > > send metrics to any broker it is connected to." To be clear, this
> means
> > > > that the client can choose 

Re: [DISCUSS] KIP-714: Client metrics and observability

2021-10-19 Thread Magnus Edenhill
Hi Mickael!

Den tis 19 okt. 2021 kl 19:30 skrev Mickael Maison :

> Hi Magnus,
>
> Thanks for the proposal.
>
> 1. Looking at the protocol section, isn't "ClientInstanceId" expected
> to be a field in GetTelemetrySubscriptionsResponseV0? Otherwise, how
> does a client retrieve this value?
>

Good catch, it got removed by mistake in one of the edits.


>
> 2. In the client API section, you mention a new method
> "clientInstanceId()". Can you clarify which interfaces are affected?
> Is it only Consumer and Producer?
>

And Admin. Will update the KIP.



> 3. I'm a bit concerned this is enabled by default. Even if the data
> collected is supposed to be not sensitive, I think this can be
> problematic in some environments. Also users don't seem to have the
> choice to only expose some metrics. Knowing how much data transit
> through some applications can be considered critical.
>

The broker already knows how much data transits through the client though,
right?
Care has been taken not to expose information in the standard metrics that
might
reveal sensitive information.

Do you have an example of how the proposed metrics could leak sensitive
information?
As for limiting the what metrics to export; I guess that could make sense
in some
very sensitive use-cases, but those users might disable metrics altogether
for now.
Could these concerns be addressed by a later KIP?



>
> 4. As a user, how do you know if your application is actively sending
> metrics? Are there new metrics exposing what's going on, like how much
> data is being sent?
>

That's a good question.
Since the proposed metrics interface is not aimed at, or directly available
to, the application
I guess there's little point of adding it here, but instead adding
something to the
existing JMX metrics?
Do you have any suggestions?



> 5. If all metrics are enabled on a regular Consumer or Producer, do
> you have an idea how much throughput this would use?
>

It depends on the number of partition/topics/etc the client is producing
to/consuming from.
I'll add some sizes to the KIP for some typical use-cases.


Thanks,
Magnus


> Thanks
>
> On Tue, Oct 19, 2021 at 5:06 PM Magnus Edenhill 
> wrote:
> >
> > Den tis 19 okt. 2021 kl 13:22 skrev Tom Bentley :
> >
> > > Hi Magnus,
> > >
> > > I reviewed the KIP since you called the vote (sorry for not reviewing
> when
> > > you announced your intention to call the vote). I have a few questions
> on
> > > some of the details.
> > >
> > > 1. There's no Javadoc on ClientTelemetryPayload.data(), so I don't know
> > > whether the payload is exposed through this method as compressed or
> not.
> > > Later on you say "Decompression of the payloads will be handled by the
> > > broker metrics plugin, the broker should expose a suitable
> decompression
> > > API to the metrics plugin for this purpose.", which suggests it's the
> > > compressed data in the buffer, but then we don't know which codec was
> used,
> > > nor the API via which the plugin should decompress it if required for
> > > forwarding to the ultimate metrics store. Should the
> ClientTelemetryPayload
> > > expose a method to get the compression and a decompressor?
> > >
> >
> > Good point, updated.
> >
> >
> >
> > > 2. The client-side API is expressed as StringOrError
> > > ClientInstance::ClientInstanceId(int timeout_ms). I understand that
> you're
> > > thinking about the librdkafka implementation, but it would be good to
> show
> > > the API as it would appear on the Apache Kafka clients.
> > >
> >
> > This was meant as pseudo-code, but I changed it to Java.
> >
> >
> > > 3. "PushTelemetryRequest|Response - protocol request used by the
> client to
> > > send metrics to any broker it is connected to." To be clear, this means
> > > that the client can choose any of the connected brokers and push to
> just
> > > one of them? What should a supporting client do if it gets an error
> when
> > > pushing metrics to a broker, retry sending to the same broker or try
> > > pushing to another broker, or drop the metrics? Should supporting
> clients
> > > send successive requests to a single broker, or round robin, or is
> that up
> > > to the client author? I'm guessing the behaviour should be sticky to
> > > support the rate limiting features, but I think it would be good for
> client
> > > authors if this section were explicit on the recommended behaviour.
> > >
> >
> > You are right, I've updated the KIP to make this clearer.
> >
> >
> > > 4. "Mapping the client instance id to an actual application instance
> > > running on a (virtual) machine can be done by inspecting the metrics
> > > resource labels, such as the client source address and source port, or
> > > security principal, all of which are added by the receiving broker.
> This
> > > will allow the operator together with the user to identify the actual
> > > application instance." Is this really always true? The source IP and
> port
> > > might be a loadbalancer/proxy in some setups. The principal, as 

Re: [DISCUSS] KIP-714: Client metrics and observability

2021-10-19 Thread Mickael Maison
Hi Magnus,

Thanks for the proposal.

1. Looking at the protocol section, isn't "ClientInstanceId" expected
to be a field in GetTelemetrySubscriptionsResponseV0? Otherwise, how
does a client retrieve this value?

2. In the client API section, you mention a new method
"clientInstanceId()". Can you clarify which interfaces are affected?
Is it only Consumer and Producer?

3. I'm a bit concerned this is enabled by default. Even if the data
collected is supposed to be not sensitive, I think this can be
problematic in some environments. Also users don't seem to have the
choice to only expose some metrics. Knowing how much data transit
through some applications can be considered critical.

4. As a user, how do you know if your application is actively sending
metrics? Are there new metrics exposing what's going on, like how much
data is being sent?

5. If all metrics are enabled on a regular Consumer or Producer, do
you have an idea how much throughput this would use?

Thanks

On Tue, Oct 19, 2021 at 5:06 PM Magnus Edenhill  wrote:
>
> Den tis 19 okt. 2021 kl 13:22 skrev Tom Bentley :
>
> > Hi Magnus,
> >
> > I reviewed the KIP since you called the vote (sorry for not reviewing when
> > you announced your intention to call the vote). I have a few questions on
> > some of the details.
> >
> > 1. There's no Javadoc on ClientTelemetryPayload.data(), so I don't know
> > whether the payload is exposed through this method as compressed or not.
> > Later on you say "Decompression of the payloads will be handled by the
> > broker metrics plugin, the broker should expose a suitable decompression
> > API to the metrics plugin for this purpose.", which suggests it's the
> > compressed data in the buffer, but then we don't know which codec was used,
> > nor the API via which the plugin should decompress it if required for
> > forwarding to the ultimate metrics store. Should the ClientTelemetryPayload
> > expose a method to get the compression and a decompressor?
> >
>
> Good point, updated.
>
>
>
> > 2. The client-side API is expressed as StringOrError
> > ClientInstance::ClientInstanceId(int timeout_ms). I understand that you're
> > thinking about the librdkafka implementation, but it would be good to show
> > the API as it would appear on the Apache Kafka clients.
> >
>
> This was meant as pseudo-code, but I changed it to Java.
>
>
> > 3. "PushTelemetryRequest|Response - protocol request used by the client to
> > send metrics to any broker it is connected to." To be clear, this means
> > that the client can choose any of the connected brokers and push to just
> > one of them? What should a supporting client do if it gets an error when
> > pushing metrics to a broker, retry sending to the same broker or try
> > pushing to another broker, or drop the metrics? Should supporting clients
> > send successive requests to a single broker, or round robin, or is that up
> > to the client author? I'm guessing the behaviour should be sticky to
> > support the rate limiting features, but I think it would be good for client
> > authors if this section were explicit on the recommended behaviour.
> >
>
> You are right, I've updated the KIP to make this clearer.
>
>
> > 4. "Mapping the client instance id to an actual application instance
> > running on a (virtual) machine can be done by inspecting the metrics
> > resource labels, such as the client source address and source port, or
> > security principal, all of which are added by the receiving broker. This
> > will allow the operator together with the user to identify the actual
> > application instance." Is this really always true? The source IP and port
> > might be a loadbalancer/proxy in some setups. The principal, as already
> > mentioned in the KIP, might be shared between multiple applications. So at
> > worst the organization running the clients might have to consult the logs
> > of a set of client applications, right?
> >
>
> Yes, that's correct. There's no guaranteed mapping from client_instance_id
> to
> an actual instance, that's why the KIP recommends client implementations to
> log the client instance id
> upon retrieval, and also provide an API for the application to retrieve the
> instance id programmatically
> if it has a better way of exposing it.
>
>
> 5. "Tests indicate that a compression ratio up to 10x is possible for the
> > standard metrics." Client authors might appreciate your mentioning which
> > compression codec got these results.
> >
>
> Good point. Updated.
>
>
> > 6. "Should the client send a push request prior to expiry of the previously
> > calculated PushIntervalMs the broker will discard the metrics and return a
> > PushTelemetryResponse with the ErrorCode set to RateLimited." Is this
> > RATE_LIMITED a new error code? It's not mentioned in the "New Error Codes"
> > section.
> >
>
> That's a leftover, it should be using the standard ThrottleTime mechanism.
> Fixed.
>
>
> > 7. In the section "Standard client resource labels" application_id is
> > 

Re: [DISCUSS] KIP-714: Client metrics and observability

2021-10-19 Thread Magnus Edenhill
Den tis 19 okt. 2021 kl 13:22 skrev Tom Bentley :

> Hi Magnus,
>
> I reviewed the KIP since you called the vote (sorry for not reviewing when
> you announced your intention to call the vote). I have a few questions on
> some of the details.
>
> 1. There's no Javadoc on ClientTelemetryPayload.data(), so I don't know
> whether the payload is exposed through this method as compressed or not.
> Later on you say "Decompression of the payloads will be handled by the
> broker metrics plugin, the broker should expose a suitable decompression
> API to the metrics plugin for this purpose.", which suggests it's the
> compressed data in the buffer, but then we don't know which codec was used,
> nor the API via which the plugin should decompress it if required for
> forwarding to the ultimate metrics store. Should the ClientTelemetryPayload
> expose a method to get the compression and a decompressor?
>

Good point, updated.



> 2. The client-side API is expressed as StringOrError
> ClientInstance::ClientInstanceId(int timeout_ms). I understand that you're
> thinking about the librdkafka implementation, but it would be good to show
> the API as it would appear on the Apache Kafka clients.
>

This was meant as pseudo-code, but I changed it to Java.


> 3. "PushTelemetryRequest|Response - protocol request used by the client to
> send metrics to any broker it is connected to." To be clear, this means
> that the client can choose any of the connected brokers and push to just
> one of them? What should a supporting client do if it gets an error when
> pushing metrics to a broker, retry sending to the same broker or try
> pushing to another broker, or drop the metrics? Should supporting clients
> send successive requests to a single broker, or round robin, or is that up
> to the client author? I'm guessing the behaviour should be sticky to
> support the rate limiting features, but I think it would be good for client
> authors if this section were explicit on the recommended behaviour.
>

You are right, I've updated the KIP to make this clearer.


> 4. "Mapping the client instance id to an actual application instance
> running on a (virtual) machine can be done by inspecting the metrics
> resource labels, such as the client source address and source port, or
> security principal, all of which are added by the receiving broker. This
> will allow the operator together with the user to identify the actual
> application instance." Is this really always true? The source IP and port
> might be a loadbalancer/proxy in some setups. The principal, as already
> mentioned in the KIP, might be shared between multiple applications. So at
> worst the organization running the clients might have to consult the logs
> of a set of client applications, right?
>

Yes, that's correct. There's no guaranteed mapping from client_instance_id
to
an actual instance, that's why the KIP recommends client implementations to
log the client instance id
upon retrieval, and also provide an API for the application to retrieve the
instance id programmatically
if it has a better way of exposing it.


5. "Tests indicate that a compression ratio up to 10x is possible for the
> standard metrics." Client authors might appreciate your mentioning which
> compression codec got these results.
>

Good point. Updated.


> 6. "Should the client send a push request prior to expiry of the previously
> calculated PushIntervalMs the broker will discard the metrics and return a
> PushTelemetryResponse with the ErrorCode set to RateLimited." Is this
> RATE_LIMITED a new error code? It's not mentioned in the "New Error Codes"
> section.
>

That's a leftover, it should be using the standard ThrottleTime mechanism.
Fixed.


> 7. In the section "Standard client resource labels" application_id is
> described as Kafka Streams only, but the section of "Client Identification"
> talks about "application instance id as an optional future nice-to-have
> that may be included as a metrics label if it has been set by the user", so
> I'm confused whether non-Kafka Streams clients should set an application_id
> or not.
>

I'll clarify this in the KIP, but basically we would need to add an `
application.id` config
property for non-streams clients for this purpose, and that's outside the
scope of this KIP since we want to make it zero-conf:ish on the client side.


>
> Kind regards,
>
> Tom
>

Thanks for the review,
Magnus



>
> On Thu, Oct 7, 2021 at 5:26 PM Magnus Edenhill  wrote:
>
> > Hi all,
> >
> > I've updated the KIP following our recent discussions on the mailing
> list:
> >  - split the protocol in two, one for getting the metrics subscriptions,
> > and one for pushing the metrics.
> >  - simplifications: initially only one supported metrics format, no
> > client.id in the instance id, etc.
> >  - made CLIENT_METRICS subscription configuration entries more structured
> >and allowing better client matching selectors (not only on the
> instance
> > id, but also the other
> >

Re: [DISCUSS] KIP-714: Client metrics and observability

2021-10-19 Thread Tom Bentley
Hi Magnus,

I reviewed the KIP since you called the vote (sorry for not reviewing when
you announced your intention to call the vote). I have a few questions on
some of the details.

1. There's no Javadoc on ClientTelemetryPayload.data(), so I don't know
whether the payload is exposed through this method as compressed or not.
Later on you say "Decompression of the payloads will be handled by the
broker metrics plugin, the broker should expose a suitable decompression
API to the metrics plugin for this purpose.", which suggests it's the
compressed data in the buffer, but then we don't know which codec was used,
nor the API via which the plugin should decompress it if required for
forwarding to the ultimate metrics store. Should the ClientTelemetryPayload
expose a method to get the compression and a decompressor?
2. The client-side API is expressed as StringOrError
ClientInstance::ClientInstanceId(int timeout_ms). I understand that you're
thinking about the librdkafka implementation, but it would be good to show
the API as it would appear on the Apache Kafka clients.
3. "PushTelemetryRequest|Response - protocol request used by the client to
send metrics to any broker it is connected to." To be clear, this means
that the client can choose any of the connected brokers and push to just
one of them? What should a supporting client do if it gets an error when
pushing metrics to a broker, retry sending to the same broker or try
pushing to another broker, or drop the metrics? Should supporting clients
send successive requests to a single broker, or round robin, or is that up
to the client author? I'm guessing the behaviour should be sticky to
support the rate limiting features, but I think it would be good for client
authors if this section were explicit on the recommended behaviour.
4. "Mapping the client instance id to an actual application instance
running on a (virtual) machine can be done by inspecting the metrics
resource labels, such as the client source address and source port, or
security principal, all of which are added by the receiving broker. This
will allow the operator together with the user to identify the actual
application instance." Is this really always true? The source IP and port
might be a loadbalancer/proxy in some setups. The principal, as already
mentioned in the KIP, might be shared between multiple applications. So at
worst the organization running the clients might have to consult the logs
of a set of client applications, right?
5. "Tests indicate that a compression ratio up to 10x is possible for the
standard metrics." Client authors might appreciate your mentioning which
compression codec got these results.
6. "Should the client send a push request prior to expiry of the previously
calculated PushIntervalMs the broker will discard the metrics and return a
PushTelemetryResponse with the ErrorCode set to RateLimited." Is this
RATE_LIMITED a new error code? It's not mentioned in the "New Error Codes"
section.
7. In the section "Standard client resource labels" application_id is
described as Kafka Streams only, but the section of "Client Identification"
talks about "application instance id as an optional future nice-to-have
that may be included as a metrics label if it has been set by the user", so
I'm confused whether non-Kafka Streams clients should set an application_id
or not.

Kind regards,

Tom

On Thu, Oct 7, 2021 at 5:26 PM Magnus Edenhill  wrote:

> Hi all,
>
> I've updated the KIP following our recent discussions on the mailing list:
>  - split the protocol in two, one for getting the metrics subscriptions,
> and one for pushing the metrics.
>  - simplifications: initially only one supported metrics format, no
> client.id in the instance id, etc.
>  - made CLIENT_METRICS subscription configuration entries more structured
>and allowing better client matching selectors (not only on the instance
> id, but also the other
>client resource labels, such as client_software_name, etc.).
>
> Unless there are further comments I'll call the vote in a day or two.
>
> Regards,
> Magnus
>
>
>
> Den mån 4 okt. 2021 kl 20:57 skrev Magnus Edenhill :
>
> > Hi Gwen,
> >
> > I'm finishing up the KIP based on the last couple of discussion points in
> > this thread
> > and will call the Vote later this week.
> >
> > Best,
> > Magnus
> >
> > Den lör 2 okt. 2021 kl 02:01 skrev Gwen Shapira
>  > >:
> >
> >> Hey,
> >>
> >> I noticed that there was no discussion for the last 10 days, but I
> >> couldn't
> >> find the vote thread. Is there one that I'm missing?
> >>
> >> Gwen
> >>
> >> On Wed, Sep 22, 2021 at 4:58 AM Magnus Edenhill 
> >> wrote:
> >>
> >> > Den tis 21 sep. 2021 kl 06:58 skrev Colin McCabe  >:
> >> >
> >> > > On Mon, Sep 20, 2021, at 17:35, Feng Min wrote:
> >> > > > Thanks Magnus & Colin for the discussion.
> >> > > >
> >> > > > Based on KIP-714's stateless design, Client can pretty much use
> any
> >> > > > connection to any broker to send metrics. We are not associating
> >> > > 

Re: [DISCUSS] KIP-714: Client metrics and observability

2021-10-07 Thread Magnus Edenhill
Hi all,

I've updated the KIP following our recent discussions on the mailing list:
 - split the protocol in two, one for getting the metrics subscriptions,
and one for pushing the metrics.
 - simplifications: initially only one supported metrics format, no
client.id in the instance id, etc.
 - made CLIENT_METRICS subscription configuration entries more structured
   and allowing better client matching selectors (not only on the instance
id, but also the other
   client resource labels, such as client_software_name, etc.).

Unless there are further comments I'll call the vote in a day or two.

Regards,
Magnus



Den mån 4 okt. 2021 kl 20:57 skrev Magnus Edenhill :

> Hi Gwen,
>
> I'm finishing up the KIP based on the last couple of discussion points in
> this thread
> and will call the Vote later this week.
>
> Best,
> Magnus
>
> Den lör 2 okt. 2021 kl 02:01 skrev Gwen Shapira  >:
>
>> Hey,
>>
>> I noticed that there was no discussion for the last 10 days, but I
>> couldn't
>> find the vote thread. Is there one that I'm missing?
>>
>> Gwen
>>
>> On Wed, Sep 22, 2021 at 4:58 AM Magnus Edenhill 
>> wrote:
>>
>> > Den tis 21 sep. 2021 kl 06:58 skrev Colin McCabe :
>> >
>> > > On Mon, Sep 20, 2021, at 17:35, Feng Min wrote:
>> > > > Thanks Magnus & Colin for the discussion.
>> > > >
>> > > > Based on KIP-714's stateless design, Client can pretty much use any
>> > > > connection to any broker to send metrics. We are not associating
>> > > connection
>> > > > with client metric state. Is my understanding correct? If yes,  how
>> > about
>> > > > the following two scenarios
>> > > >
>> > > > 1) One Client (Client-ID) registers two different client instance id
>> > via
>> > > > separate registration. Is it permitted? If OK, how to distinguish
>> them
>> > > from
>> > > > the case 2 below.
>> > > >
>> > >
>> > > Hi Feng,
>> > >
>> > > My understanding, which Magnus can clarify I guess, is that you could
>> > have
>> > > something like two Producer instances running with the same client.id
>> > > (perhaps because they're using the same config file, for example).
>> They
>> > > could even be in the same process. But they would get separate UUIDs.
>> > >
>> > > I believe Magnus used the term client to mean "Producer or Consumer".
>> So
>> > > if you have both a Producer and a Consumer in your application I would
>> > > expect you'd get separate UUIDs for both. Again Magnus can chime in
>> > here, I
>> > > guess.
>> > >
>> >
>> > That's correct.
>> >
>> >
>> > >
>> > > > 2) How about the client restarting? What's the expectation? Should
>> the
>> > > > server expect the client to carry a persisted client instance id or
>> > > should
>> > > > the client be treated as a new instance?
>> > >
>> > > The KIP doesn't describe any mechanism for persistence, so I would
>> assume
>> > > that when you restart the client you get a new UUID. I agree that it
>> > would
>> > > be good to spell this out.
>> > >
>> > >
>> > Right, it will not be persisted since a client instance can't be
>> restarted.
>> >
>> > Will update the KIP to make this clearer.
>> >
>> > /Magnus
>> >
>>
>>
>> --
>> Gwen Shapira
>> Engineering Manager | Confluent
>> 650.450.2760 | @gwenshap
>> Follow us: Twitter | blog
>>
>


Re: [DISCUSS] KIP-714: Client metrics and observability

2021-10-04 Thread Magnus Edenhill
Hi Gwen,

I'm finishing up the KIP based on the last couple of discussion points in
this thread
and will call the Vote later this week.

Best,
Magnus

Den lör 2 okt. 2021 kl 02:01 skrev Gwen Shapira :

> Hey,
>
> I noticed that there was no discussion for the last 10 days, but I couldn't
> find the vote thread. Is there one that I'm missing?
>
> Gwen
>
> On Wed, Sep 22, 2021 at 4:58 AM Magnus Edenhill 
> wrote:
>
> > Den tis 21 sep. 2021 kl 06:58 skrev Colin McCabe :
> >
> > > On Mon, Sep 20, 2021, at 17:35, Feng Min wrote:
> > > > Thanks Magnus & Colin for the discussion.
> > > >
> > > > Based on KIP-714's stateless design, Client can pretty much use any
> > > > connection to any broker to send metrics. We are not associating
> > > connection
> > > > with client metric state. Is my understanding correct? If yes,  how
> > about
> > > > the following two scenarios
> > > >
> > > > 1) One Client (Client-ID) registers two different client instance id
> > via
> > > > separate registration. Is it permitted? If OK, how to distinguish
> them
> > > from
> > > > the case 2 below.
> > > >
> > >
> > > Hi Feng,
> > >
> > > My understanding, which Magnus can clarify I guess, is that you could
> > have
> > > something like two Producer instances running with the same client.id
> > > (perhaps because they're using the same config file, for example). They
> > > could even be in the same process. But they would get separate UUIDs.
> > >
> > > I believe Magnus used the term client to mean "Producer or Consumer".
> So
> > > if you have both a Producer and a Consumer in your application I would
> > > expect you'd get separate UUIDs for both. Again Magnus can chime in
> > here, I
> > > guess.
> > >
> >
> > That's correct.
> >
> >
> > >
> > > > 2) How about the client restarting? What's the expectation? Should
> the
> > > > server expect the client to carry a persisted client instance id or
> > > should
> > > > the client be treated as a new instance?
> > >
> > > The KIP doesn't describe any mechanism for persistence, so I would
> assume
> > > that when you restart the client you get a new UUID. I agree that it
> > would
> > > be good to spell this out.
> > >
> > >
> > Right, it will not be persisted since a client instance can't be
> restarted.
> >
> > Will update the KIP to make this clearer.
> >
> > /Magnus
> >
>
>
> --
> Gwen Shapira
> Engineering Manager | Confluent
> 650.450.2760 | @gwenshap
> Follow us: Twitter | blog
>


Re: [DISCUSS] KIP-714: Client metrics and observability

2021-10-01 Thread Gwen Shapira
Hey,

I noticed that there was no discussion for the last 10 days, but I couldn't
find the vote thread. Is there one that I'm missing?

Gwen

On Wed, Sep 22, 2021 at 4:58 AM Magnus Edenhill  wrote:

> Den tis 21 sep. 2021 kl 06:58 skrev Colin McCabe :
>
> > On Mon, Sep 20, 2021, at 17:35, Feng Min wrote:
> > > Thanks Magnus & Colin for the discussion.
> > >
> > > Based on KIP-714's stateless design, Client can pretty much use any
> > > connection to any broker to send metrics. We are not associating
> > connection
> > > with client metric state. Is my understanding correct? If yes,  how
> about
> > > the following two scenarios
> > >
> > > 1) One Client (Client-ID) registers two different client instance id
> via
> > > separate registration. Is it permitted? If OK, how to distinguish them
> > from
> > > the case 2 below.
> > >
> >
> > Hi Feng,
> >
> > My understanding, which Magnus can clarify I guess, is that you could
> have
> > something like two Producer instances running with the same client.id
> > (perhaps because they're using the same config file, for example). They
> > could even be in the same process. But they would get separate UUIDs.
> >
> > I believe Magnus used the term client to mean "Producer or Consumer". So
> > if you have both a Producer and a Consumer in your application I would
> > expect you'd get separate UUIDs for both. Again Magnus can chime in
> here, I
> > guess.
> >
>
> That's correct.
>
>
> >
> > > 2) How about the client restarting? What's the expectation? Should the
> > > server expect the client to carry a persisted client instance id or
> > should
> > > the client be treated as a new instance?
> >
> > The KIP doesn't describe any mechanism for persistence, so I would assume
> > that when you restart the client you get a new UUID. I agree that it
> would
> > be good to spell this out.
> >
> >
> Right, it will not be persisted since a client instance can't be restarted.
>
> Will update the KIP to make this clearer.
>
> /Magnus
>


-- 
Gwen Shapira
Engineering Manager | Confluent
650.450.2760 | @gwenshap
Follow us: Twitter | blog


Re: [DISCUSS] KIP-714: Client metrics and observability

2021-09-22 Thread Magnus Edenhill
Den tis 21 sep. 2021 kl 06:58 skrev Colin McCabe :

> On Mon, Sep 20, 2021, at 17:35, Feng Min wrote:
> > Thanks Magnus & Colin for the discussion.
> >
> > Based on KIP-714's stateless design, Client can pretty much use any
> > connection to any broker to send metrics. We are not associating
> connection
> > with client metric state. Is my understanding correct? If yes,  how about
> > the following two scenarios
> >
> > 1) One Client (Client-ID) registers two different client instance id via
> > separate registration. Is it permitted? If OK, how to distinguish them
> from
> > the case 2 below.
> >
>
> Hi Feng,
>
> My understanding, which Magnus can clarify I guess, is that you could have
> something like two Producer instances running with the same client.id
> (perhaps because they're using the same config file, for example). They
> could even be in the same process. But they would get separate UUIDs.
>
> I believe Magnus used the term client to mean "Producer or Consumer". So
> if you have both a Producer and a Consumer in your application I would
> expect you'd get separate UUIDs for both. Again Magnus can chime in here, I
> guess.
>

That's correct.


>
> > 2) How about the client restarting? What's the expectation? Should the
> > server expect the client to carry a persisted client instance id or
> should
> > the client be treated as a new instance?
>
> The KIP doesn't describe any mechanism for persistence, so I would assume
> that when you restart the client you get a new UUID. I agree that it would
> be good to spell this out.
>
>
Right, it will not be persisted since a client instance can't be restarted.

Will update the KIP to make this clearer.

/Magnus


Re: [DISCUSS] KIP-714: Client metrics and observability

2021-09-22 Thread Magnus Edenhill
Den mån 20 sep. 2021 kl 20:41 skrev Colin McCabe :

> On Tue, Sep 14, 2021, at 00:47, Magnus Edenhill wrote:
> > Thanks for your feedback Colin, see my updated proposal below.
> > ...
>
> Hi Magnus,
>
> Thanks for the update.
>
> >
> > Splitting up the API into separate data and control requests makes sense.
> > With a split we would have one API for querying the broker for configured
> > metrics subscriptions,
> > and one API for pushing the collected metrics to the broker.
> >
> > A mechanism is still needed to notify the client when the subscription is
> > changed;
> > I’ve added a SubscriptionId for this purpose (which could be a checksum
> of
> > the configured metrics subscription), this id is sent to the client along
> > with the metrics subscription, and the client sends it back to the broker
> > when pushing metrics. If the broker finds the pushed subscription id to
> > differ from what is expected it will return an error to the client, which
> > triggers the client to retrieve the new subscribed metrics and an updated
> > subscription id. The generation of the subscriptionId is opaque to the
> > client.
> >
>
> Hmm, SubscriptionId seems rather complex. We don't have this kind of
> complicated machinery for changing ApiVersions, and that is something that
> can also change over time, and which affects the clients.
>

I'm not sure how it relates to ApiVersion?
The SubscriptionId is a rather simple and stateless way to make sure the
client is using the most recently configured metrics subscriptions.


>
> Changing the configured metrics should be extremely rare. In this case,
> why don't we just close all connections on the broker side? Then the
> clients can re-connect and re-fetch the information about the metrics
> they're supposed to send.


While the standard metrics subscription is rarely updated, the second
use-case of troubleshooting a specific client will require the metrics
subscription to be updated and propagated in a timely manner.
Closing all client connections on the broker side is quite an intrusive
thing to do, will create a thundering horde of reconnects, and doesn't
really serve much of a purpose since the metrics in
this proposal are explicitly not bound to a single broker connection, but
to a client instance, allowing any broker connection to be used. This is a
feature of the proposal.



> >
> > Something like this:
> >
> > // Get the configured metrics subscription.
> > GetTelemetrySubscriptionsRequest {
> >StrNull  ClientInstanceId  // Null on first invocation to retrieve a
> > newly generated instance id from the broker.
> > }
>
> It seems like the goal here is to have the client register itself, so that
> we can tell if this is an old client reconnecting. If that is the case,
> then I suggest to rename the RPC to RegisterClient.
>

Registering a client is perhaps a good idea, but a bigger take that also
involves other parts of the protocol (we wouldn't need to send the ClientId
in the protocol header, for instance), and is
thus outside the scope of this proposal. What we are aiming to provide here
is as stateless transmission as possible of client metrics through any
broker.


>
> I think we need a better name than "clientInstanceId" since that name is
> very similar to "clientId." Perhaps something like originId? Or clientUuid?
> Let's also use UUID here rather than a string.
>

I believe clientInstanceId is descriptive of what it is; the identification
of a specific client instance or incarnation.
There's some previous art here, e.g., group.id vs group.instance.id. (in
this case it made sense to have the instance configurable, but we want
metrics to be zero-conf).

+1 on using the UUID type as opposed to string.


>
> > 6. > PushTelemetryRequest{
> >ClientInstanceId = f00d-feed-deff-ceff--….,
> >SubscriptionId = 0x234adf34,
> >ContentType = OTLPv8|ZSTD,
> >Terminating = False,
> >Metrics = …// zstd-compressed OTLPv08-protobuf-serialized metrics
> >   }
>
> It's not necessary for the client to re-send its client instance ID here,
> since it already registered with RegisterClient. If the TCP connection
> dropped, it will have to re-send RegisterClient anyway. SubscriptionID we
> should get rid of, as I said above.
>

The overhead of resending the client instance id (16 bytes) is minimal in
relation to the metrics data itself, and it is typically only sent every
60s or so.

As for caching it on the connection; the client only requests the
clientInstanceId once per client instance lifetime, not per broker
connection, so the client does not need to re-register itself on each
connection.

The SubscriptionId is used to make sure the client's metrics subscription
is up to date, pretty much a configuration version but without the need for
sequanciality checks, a simple inequality check by the broker is sufficient.



> I don't see the need for protobufs. Why not just use Kafka's own
> serialization mechanism? As much as possible, we 

Re: [DISCUSS] KIP-714: Client metrics and observability

2021-09-21 Thread Feng Min
Hi Colin,

It was just analogy to say api version is similar to subscription Id. Every
request come with api version information, broker can return an error if
supported api version has been changed. It’s similar to the role of
subscriptionid here.

Thanks,
Feng



On Mon, Sep 20, 2021 at 9:51 PM Colin McCabe  wrote:

> On Mon, Sep 20, 2021, at 12:30, Feng Min wrote:
> > Some comments about subscriptionId.
> >
> > ApiVersion is not a good example. API Version here is actually acting
> like
> > an identifier as the client will carry this information. Forcing to
> > disconnect a connection from the server side is quite heavy. IMHO, the
> > behavior is kind of part of the protocol. Adding subscriptionId is
> > relatively simple and straightforward.
> >
>
> Hi Feng,
>
> Sorry, I'm not sure what you mean by "API Version here is actually acting
> like an identifier." APIVersions is not an identifier. Each client gets the
> same ApiVersionsResponse from the broker. In most clusters, each broker
> will return the same set of ApiVersionsResponse as well. So you can not use
> ApiVersionsResponse as an identifier of anything, as far as I can see.
>
> Dropping a connection is not that "heavy" considering that it only has to
> happen when we change the metrics subscription, which should be a very rare
> event, if I understand the proposal correctly.
>
> best,
> Colin
>
>
> >
> >
> >> Hmm, SubscriptionId seems rather complex. We don't have this kind of
> >> complicated machinery for changing ApiVersions, and that is something
> that
> >> can also change over time, and which affects the clients.
> >>
> >> Changing the configured metrics should be extremely rare. In this case,
> >> why don't we just close all connections on the broker side? Then the
> >> clients can re-connect and re-fetch the information about the metrics
> >> they're supposed to send.
> >>
> >> >
> >> > Something like this:
> >> >
> >> > // Get the configured metrics subscription.
> >> > GetTelemetrySubscriptionsRequest {
> >> >StrNull  ClientInstanceId  // Null on first invocation to retrieve
> a
> >> > newly generated instance id from the broker.
> >> > }
> >>
> >> It seems like the goal here is to have the client register itself, so
> that
> >> we can tell if this is an old client reconnecting. If that is the case,
> >> then I suggest to rename the RPC to RegisterClient.
> >>
> >> I think we need a better name than "clientInstanceId" since that name is
> >> very similar to "clientId." Perhaps something like originId? Or
> clientUuid?
> >> Let's also use UUID here rather than a string.
> >>
> >> > 6. > PushTelemetryRequest{
> >> >ClientInstanceId = f00d-feed-deff-ceff--….,
> >> >SubscriptionId = 0x234adf34,
> >> >ContentType = OTLPv8|ZSTD,
> >> >Terminating = False,
> >> >Metrics = …// zstd-compressed OTLPv08-protobuf-serialized
> metrics
> >> >   }
> >>
> >> It's not necessary for the client to re-send its client instance ID
> here,
> >> since it already registered with RegisterClient. If the TCP connection
> >> dropped, it will have to re-send RegisterClient anyway. SubscriptionID
> we
> >> should get rid of, as I said above.
> >>
> >> I don't see the need for protobufs. Why not just use Kafka's own
> >> serialization mechanism? As much as possible, we should try to avoid
> >> creating "turduckens" of protocol X containing a buffer serialized with
> >> protocol Y, containing a protocol serialized with protocol Z. These
> aren't
> >> conducive to a good implementation, and make it harder for people to
> write
> >> clients. Just use Kafka's RPC protocol (with optional fields if you
> wish).
> >>
> >> If we do compression on Kafka RPC, I would prefer that we do it a more
> >> generic way that applies to all control messages, not just this one. I
> also
> >> doubt we need to support lots and lots of different compression codecs,
> at
> >> first at least.
> >>
> >> Another thing I'd like to understand is whether we truly need
> >> "terminating" (or anything like it). I'm still confused about how the
> >> backend could use this. Keep in mind that we may receive it on multiple
> >> brokers (or not receive it at all). We may receive more stuff about
> client
> >> XYZ from broker 1 after we have already received a "terminated" for
> client
> >> XYZ from broker 2.
> >>
> >> > If the broker connection goes down or the connection is to be used for
> >> > other purposes (e.g., blocking FetchRequests), the client will send
> >> > PushTelemetryRequests to any other broker in the cluster, using the
> same
> >> > ClientInstanceId and SubscriptionId as received in the latest
> >> > GetTelemetrySubscriptionsResponse.
> >> >
> >> > While the subscriptionId may change during the lifetime of the client
> >> > instance (when metric subscriptions are updated), the
> ClientInstanceId is
> >> > only acquired once and must not change (as it is used to identify the
> >> > unique client instance).
> >> > ...
> >> > What we do 

Re: [DISCUSS] KIP-714: Client metrics and observability

2021-09-20 Thread Colin McCabe
On Mon, Sep 20, 2021, at 17:35, Feng Min wrote:
> Thanks Magnus & Colin for the discussion.
>
> Based on KIP-714's stateless design, Client can pretty much use any
> connection to any broker to send metrics. We are not associating connection
> with client metric state. Is my understanding correct? If yes,  how about
> the following two scenarios
>
> 1) One Client (Client-ID) registers two different client instance id via
> separate registration. Is it permitted? If OK, how to distinguish them from
> the case 2 below.
>

Hi Feng,

My understanding, which Magnus can clarify I guess, is that you could have 
something like two Producer instances running with the same client.id (perhaps 
because they're using the same config file, for example). They could even be in 
the same process. But they would get separate UUIDs.

I believe Magnus used the term client to mean "Producer or Consumer". So if you 
have both a Producer and a Consumer in your application I would expect you'd 
get separate UUIDs for both. Again Magnus can chime in here, I guess.

> 2) How about the client restarting? What's the expectation? Should the
> server expect the client to carry a persisted client instance id or should
> the client be treated as a new instance?

The KIP doesn't describe any mechanism for persistence, so I would assume that 
when you restart the client you get a new UUID. I agree that it would be good 
to spell this out.

> also some comments inline.
>
> On Mon, Sep 20, 2021 at 11:41 AM Colin McCabe  wrote:
>
> ...
>
>> It seems like the goal here is to have the client register itself, so that
>> we can tell if this is an old client reconnecting. If that is the case,
>> then I suggest to rename the RPC to RegisterClient.
>>
>> I think we need a better name than "clientInstanceId" since that name is
>> very similar to "clientId." Perhaps something like originId? Or clientUuid?
>> Let's also use UUID here rather than a string.
>>
>> > 6. > PushTelemetryRequest{
>> >ClientInstanceId = f00d-feed-deff-ceff--….,
>> >SubscriptionId = 0x234adf34,
>> >ContentType = OTLPv8|ZSTD,
>> >Terminating = False,
>> >Metrics = …// zstd-compressed OTLPv08-protobuf-serialized metrics
>> >   }
>>
>>
> If we assume connection is not bound to ClientInstanceId, and the RPC can
> be sent to any broker (not necessarily the broker doing the registration).
> The client-instance-id is required for every metric reporting. It's just
> part of the labelling.
>

Hmm, I don't quite follow. I suggested using a name that was less confusingly 
similar to "client ID". Your response states that "the client-instance-id is 
required for every metric reporting... it's just part of the labelling". I 
don't see how the UUID being required for metric reporting is related to what 
its name should be. Did you mean to reply to a different point here?

best,
Colin


>
>> It's not necessary for the client to re-send its client instance ID here,
>> since it already registered with RegisterClient. If the TCP connection
>> dropped, it will have to re-send RegisterClient anyway. SubscriptionID we
>> should get rid of, as I said above.
>>
>> I don't see the need for protobufs. Why not just use Kafka's own
>> serialization mechanism? As much as possible, we should try to avoid
>> creating "turduckens" of protocol X containing a buffer serialized with
>> protocol Y, containing a protocol serialized with protocol Z. These aren't
>> conducive to a good implementation, and make it harder for people to write
>> clients. Just use Kafka's RPC protocol (with optional fields if you wish).
>>
>> If we do compression on Kafka RPC, I would prefer that we do it a more
>> generic way that applies to all control messages, not just this one. I also
>> doubt we need to support lots and lots of different compression codecs, at
>> first at least.
>>
>> Another thing I'd like to understand is whether we truly need
>> "terminating" (or anything like it). I'm still confused about how the
>> backend could use this. Keep in mind that we may receive it on multiple
>> brokers (or not receive it at all). We may receive more stuff about client
>> XYZ from broker 1 after we have already received a "terminated" for client
>> XYZ from broker 2.
>>
>> > If the broker connection goes down or the connection is to be used for
>> > other purposes (e.g., blocking FetchRequests), the client will send
>> > PushTelemetryRequests to any other broker in the cluster, using the same
>> > ClientInstanceId and SubscriptionId as received in the latest
>> > GetTelemetrySubscriptionsResponse.
>> >
>> > While the subscriptionId may change during the lifetime of the client
>> > instance (when metric subscriptions are updated), the ClientInstanceId is
>> > only acquired once and must not change (as it is used to identify the
>> > unique client instance).
>> > ...
>> > What we do want though is ability to single out a specific client
>> instance
>> > to give it a more fine-grained 

Re: [DISCUSS] KIP-714: Client metrics and observability

2021-09-20 Thread Colin McCabe
On Mon, Sep 20, 2021, at 12:30, Feng Min wrote:
> Some comments about subscriptionId.
>
> ApiVersion is not a good example. API Version here is actually acting like
> an identifier as the client will carry this information. Forcing to
> disconnect a connection from the server side is quite heavy. IMHO, the
> behavior is kind of part of the protocol. Adding subscriptionId is
> relatively simple and straightforward.
>

Hi Feng,

Sorry, I'm not sure what you mean by "API Version here is actually acting like 
an identifier." APIVersions is not an identifier. Each client gets the same 
ApiVersionsResponse from the broker. In most clusters, each broker will return 
the same set of ApiVersionsResponse as well. So you can not use 
ApiVersionsResponse as an identifier of anything, as far as I can see.

Dropping a connection is not that "heavy" considering that it only has to 
happen when we change the metrics subscription, which should be a very rare 
event, if I understand the proposal correctly.

best,
Colin


>
>
>> Hmm, SubscriptionId seems rather complex. We don't have this kind of
>> complicated machinery for changing ApiVersions, and that is something that
>> can also change over time, and which affects the clients.
>>
>> Changing the configured metrics should be extremely rare. In this case,
>> why don't we just close all connections on the broker side? Then the
>> clients can re-connect and re-fetch the information about the metrics
>> they're supposed to send.
>>
>> >
>> > Something like this:
>> >
>> > // Get the configured metrics subscription.
>> > GetTelemetrySubscriptionsRequest {
>> >StrNull  ClientInstanceId  // Null on first invocation to retrieve a
>> > newly generated instance id from the broker.
>> > }
>>
>> It seems like the goal here is to have the client register itself, so that
>> we can tell if this is an old client reconnecting. If that is the case,
>> then I suggest to rename the RPC to RegisterClient.
>>
>> I think we need a better name than "clientInstanceId" since that name is
>> very similar to "clientId." Perhaps something like originId? Or clientUuid?
>> Let's also use UUID here rather than a string.
>>
>> > 6. > PushTelemetryRequest{
>> >ClientInstanceId = f00d-feed-deff-ceff--….,
>> >SubscriptionId = 0x234adf34,
>> >ContentType = OTLPv8|ZSTD,
>> >Terminating = False,
>> >Metrics = …// zstd-compressed OTLPv08-protobuf-serialized metrics
>> >   }
>>
>> It's not necessary for the client to re-send its client instance ID here,
>> since it already registered with RegisterClient. If the TCP connection
>> dropped, it will have to re-send RegisterClient anyway. SubscriptionID we
>> should get rid of, as I said above.
>>
>> I don't see the need for protobufs. Why not just use Kafka's own
>> serialization mechanism? As much as possible, we should try to avoid
>> creating "turduckens" of protocol X containing a buffer serialized with
>> protocol Y, containing a protocol serialized with protocol Z. These aren't
>> conducive to a good implementation, and make it harder for people to write
>> clients. Just use Kafka's RPC protocol (with optional fields if you wish).
>>
>> If we do compression on Kafka RPC, I would prefer that we do it a more
>> generic way that applies to all control messages, not just this one. I also
>> doubt we need to support lots and lots of different compression codecs, at
>> first at least.
>>
>> Another thing I'd like to understand is whether we truly need
>> "terminating" (or anything like it). I'm still confused about how the
>> backend could use this. Keep in mind that we may receive it on multiple
>> brokers (or not receive it at all). We may receive more stuff about client
>> XYZ from broker 1 after we have already received a "terminated" for client
>> XYZ from broker 2.
>>
>> > If the broker connection goes down or the connection is to be used for
>> > other purposes (e.g., blocking FetchRequests), the client will send
>> > PushTelemetryRequests to any other broker in the cluster, using the same
>> > ClientInstanceId and SubscriptionId as received in the latest
>> > GetTelemetrySubscriptionsResponse.
>> >
>> > While the subscriptionId may change during the lifetime of the client
>> > instance (when metric subscriptions are updated), the ClientInstanceId is
>> > only acquired once and must not change (as it is used to identify the
>> > unique client instance).
>> > ...
>> > What we do want though is ability to single out a specific client
>> instance
>> > to give it a more fine-grained subscription for troubleshooting, and
>> > we can do that with the current proposal with matching solely on the
>> > CLIENT_INSTANCE_ID.
>> > In other words; all clients will have the same standard metrics
>> > subscription, but specific client instances can have alternate
>> > subscriptions.
>>
>> That makes sense, and gives a good reason why we might want to couple
>> finding the metrics info to passing the client UUID.
>>
>> > The 

Re: [DISCUSS] KIP-714: Client metrics and observability

2021-09-20 Thread Feng Min
Thanks Magnus & Colin for the discussion.

Based on KIP-714's stateless design, Client can pretty much use any
connection to any broker to send metrics. We are not associating connection
with client metric state. Is my understanding correct? If yes,  how about
the following two scenarios

1) One Client (Client-ID) registers two different client instance id via
separate registration. Is it permitted? If OK, how to distinguish them from
the case 2 below.

2) How about the client restarting? What's the expectation? Should the
server expect the client to carry a persisted client instance id or should
the client be treated as a new instance?

also some comments inline.

On Mon, Sep 20, 2021 at 11:41 AM Colin McCabe  wrote:

> On Tue, Sep 14, 2021, at 00:47, Magnus Edenhill wrote:
> > Thanks for your feedback Colin, see my updated proposal below.
> > ...
>
> Hi Magnus,
>
> Thanks for the update.
>
> >
> > Splitting up the API into separate data and control requests makes sense.
> > With a split we would have one API for querying the broker for configured
> > metrics subscriptions,
> > and one API for pushing the collected metrics to the broker.
> >
> > A mechanism is still needed to notify the client when the subscription is
> > changed;
> > I’ve added a SubscriptionId for this purpose (which could be a checksum
> of
> > the configured metrics subscription), this id is sent to the client along
> > with the metrics subscription, and the client sends it back to the broker
> > when pushing metrics. If the broker finds the pushed subscription id to
> > differ from what is expected it will return an error to the client, which
> > triggers the client to retrieve the new subscribed metrics and an updated
> > subscription id. The generation of the subscriptionId is opaque to the
> > client.
> >
>
> Hmm, SubscriptionId seems rather complex. We don't have this kind of
> complicated machinery for changing ApiVersions, and that is something that
> can also change over time, and which affects the clients.
>
> Changing the configured metrics should be extremely rare. In this case,
> why don't we just close all connections on the broker side? Then the
> clients can re-connect and re-fetch the information about the metrics
> they're supposed to send.
>
> >
> > Something like this:
> >
> > // Get the configured metrics subscription.
> > GetTelemetrySubscriptionsRequest {
> >StrNull  ClientInstanceId  // Null on first invocation to retrieve a
> > newly generated instance id from the broker.
> > }
>
> +1 on RegisterClient or RegisterMetricClient


> It seems like the goal here is to have the client register itself, so that
> we can tell if this is an old client reconnecting. If that is the case,
> then I suggest to rename the RPC to RegisterClient.
>
> I think we need a better name than "clientInstanceId" since that name is
> very similar to "clientId." Perhaps something like originId? Or clientUuid?
> Let's also use UUID here rather than a string.
>
> > 6. > PushTelemetryRequest{
> >ClientInstanceId = f00d-feed-deff-ceff--….,
> >SubscriptionId = 0x234adf34,
> >ContentType = OTLPv8|ZSTD,
> >Terminating = False,
> >Metrics = …// zstd-compressed OTLPv08-protobuf-serialized metrics
> >   }
>
>
If we assume connection is not bound to ClientInstanceId, and the RPC can
be sent to any broker (not necessarily the broker doing the registration).
The client-instance-id is required for every metric reporting. It's just
part of the labelling.


> It's not necessary for the client to re-send its client instance ID here,
> since it already registered with RegisterClient. If the TCP connection
> dropped, it will have to re-send RegisterClient anyway. SubscriptionID we
> should get rid of, as I said above.
>
> I don't see the need for protobufs. Why not just use Kafka's own
> serialization mechanism? As much as possible, we should try to avoid
> creating "turduckens" of protocol X containing a buffer serialized with
> protocol Y, containing a protocol serialized with protocol Z. These aren't
> conducive to a good implementation, and make it harder for people to write
> clients. Just use Kafka's RPC protocol (with optional fields if you wish).
>
> If we do compression on Kafka RPC, I would prefer that we do it a more
> generic way that applies to all control messages, not just this one. I also
> doubt we need to support lots and lots of different compression codecs, at
> first at least.
>
> Another thing I'd like to understand is whether we truly need
> "terminating" (or anything like it). I'm still confused about how the
> backend could use this. Keep in mind that we may receive it on multiple
> brokers (or not receive it at all). We may receive more stuff about client
> XYZ from broker 1 after we have already received a "terminated" for client
> XYZ from broker 2.
>
> > If the broker connection goes down or the connection is to be used for
> > other purposes (e.g., blocking FetchRequests), the 

Re: [DISCUSS] KIP-714: Client metrics and observability

2021-09-20 Thread Feng Min
Some comments about subscriptionId.

On Mon, Sep 20, 2021 at 11:41 AM Colin McCabe  wrote:

> On Tue, Sep 14, 2021, at 00:47, Magnus Edenhill wrote:
> > Thanks for your feedback Colin, see my updated proposal below.
> > ...
>
> Hi Magnus,
>
> Thanks for the update.
>
> >
> > Splitting up the API into separate data and control requests makes sense.
> > With a split we would have one API for querying the broker for configured
> > metrics subscriptions,
> > and one API for pushing the collected metrics to the broker.
> >
> > A mechanism is still needed to notify the client when the subscription is
> > changed;
> > I’ve added a SubscriptionId for this purpose (which could be a checksum
> of
> > the configured metrics subscription), this id is sent to the client along
> > with the metrics subscription, and the client sends it back to the broker
> > when pushing metrics. If the broker finds the pushed subscription id to
> > differ from what is expected it will return an error to the client, which
> > triggers the client to retrieve the new subscribed metrics and an updated
> > subscription id. The generation of the subscriptionId is opaque to the
> > client.
> >
>
>
ApiVersion is not a good example. API Version here is actually acting like
an identifier as the client will carry this information. Forcing to
disconnect a connection from the server side is quite heavy. IMHO, the
behavior is kind of part of the protocol. Adding subscriptionId is
relatively simple and straightforward.



> Hmm, SubscriptionId seems rather complex. We don't have this kind of
> complicated machinery for changing ApiVersions, and that is something that
> can also change over time, and which affects the clients.
>
> Changing the configured metrics should be extremely rare. In this case,
> why don't we just close all connections on the broker side? Then the
> clients can re-connect and re-fetch the information about the metrics
> they're supposed to send.
>
> >
> > Something like this:
> >
> > // Get the configured metrics subscription.
> > GetTelemetrySubscriptionsRequest {
> >StrNull  ClientInstanceId  // Null on first invocation to retrieve a
> > newly generated instance id from the broker.
> > }
>
> It seems like the goal here is to have the client register itself, so that
> we can tell if this is an old client reconnecting. If that is the case,
> then I suggest to rename the RPC to RegisterClient.
>
> I think we need a better name than "clientInstanceId" since that name is
> very similar to "clientId." Perhaps something like originId? Or clientUuid?
> Let's also use UUID here rather than a string.
>
> > 6. > PushTelemetryRequest{
> >ClientInstanceId = f00d-feed-deff-ceff--….,
> >SubscriptionId = 0x234adf34,
> >ContentType = OTLPv8|ZSTD,
> >Terminating = False,
> >Metrics = …// zstd-compressed OTLPv08-protobuf-serialized metrics
> >   }
>
> It's not necessary for the client to re-send its client instance ID here,
> since it already registered with RegisterClient. If the TCP connection
> dropped, it will have to re-send RegisterClient anyway. SubscriptionID we
> should get rid of, as I said above.
>
> I don't see the need for protobufs. Why not just use Kafka's own
> serialization mechanism? As much as possible, we should try to avoid
> creating "turduckens" of protocol X containing a buffer serialized with
> protocol Y, containing a protocol serialized with protocol Z. These aren't
> conducive to a good implementation, and make it harder for people to write
> clients. Just use Kafka's RPC protocol (with optional fields if you wish).
>
> If we do compression on Kafka RPC, I would prefer that we do it a more
> generic way that applies to all control messages, not just this one. I also
> doubt we need to support lots and lots of different compression codecs, at
> first at least.
>
> Another thing I'd like to understand is whether we truly need
> "terminating" (or anything like it). I'm still confused about how the
> backend could use this. Keep in mind that we may receive it on multiple
> brokers (or not receive it at all). We may receive more stuff about client
> XYZ from broker 1 after we have already received a "terminated" for client
> XYZ from broker 2.
>
> > If the broker connection goes down or the connection is to be used for
> > other purposes (e.g., blocking FetchRequests), the client will send
> > PushTelemetryRequests to any other broker in the cluster, using the same
> > ClientInstanceId and SubscriptionId as received in the latest
> > GetTelemetrySubscriptionsResponse.
> >
> > While the subscriptionId may change during the lifetime of the client
> > instance (when metric subscriptions are updated), the ClientInstanceId is
> > only acquired once and must not change (as it is used to identify the
> > unique client instance).
> > ...
> > What we do want though is ability to single out a specific client
> instance
> > to give it a more fine-grained subscription for 

Re: [DISCUSS] KIP-714: Client metrics and observability

2021-09-20 Thread Colin McCabe
On Tue, Sep 14, 2021, at 00:47, Magnus Edenhill wrote:
> Thanks for your feedback Colin, see my updated proposal below.
> ...

Hi Magnus,

Thanks for the update.

> 
> Splitting up the API into separate data and control requests makes sense.
> With a split we would have one API for querying the broker for configured
> metrics subscriptions,
> and one API for pushing the collected metrics to the broker.
> 
> A mechanism is still needed to notify the client when the subscription is
> changed;
> I’ve added a SubscriptionId for this purpose (which could be a checksum of
> the configured metrics subscription), this id is sent to the client along
> with the metrics subscription, and the client sends it back to the broker
> when pushing metrics. If the broker finds the pushed subscription id to
> differ from what is expected it will return an error to the client, which
> triggers the client to retrieve the new subscribed metrics and an updated
> subscription id. The generation of the subscriptionId is opaque to the
> client.
>

Hmm, SubscriptionId seems rather complex. We don't have this kind of 
complicated machinery for changing ApiVersions, and that is something that can 
also change over time, and which affects the clients.

Changing the configured metrics should be extremely rare. In this case, why 
don't we just close all connections on the broker side? Then the clients can 
re-connect and re-fetch the information about the metrics they're supposed to 
send.

> 
> Something like this:
> 
> // Get the configured metrics subscription.
> GetTelemetrySubscriptionsRequest {
>StrNull  ClientInstanceId  // Null on first invocation to retrieve a
> newly generated instance id from the broker.
> }

It seems like the goal here is to have the client register itself, so that we 
can tell if this is an old client reconnecting. If that is the case, then I 
suggest to rename the RPC to RegisterClient.

I think we need a better name than "clientInstanceId" since that name is very 
similar to "clientId." Perhaps something like originId? Or clientUuid? Let's 
also use UUID here rather than a string.

> 6. > PushTelemetryRequest{
>ClientInstanceId = f00d-feed-deff-ceff--….,
>SubscriptionId = 0x234adf34,
>ContentType = OTLPv8|ZSTD,
>Terminating = False,
>Metrics = …// zstd-compressed OTLPv08-protobuf-serialized metrics
>   }

It's not necessary for the client to re-send its client instance ID here, since 
it already registered with RegisterClient. If the TCP connection dropped, it 
will have to re-send RegisterClient anyway. SubscriptionID we should get rid 
of, as I said above.

I don't see the need for protobufs. Why not just use Kafka's own serialization 
mechanism? As much as possible, we should try to avoid creating "turduckens" of 
protocol X containing a buffer serialized with protocol Y, containing a 
protocol serialized with protocol Z. These aren't conducive to a good 
implementation, and make it harder for people to write clients. Just use 
Kafka's RPC protocol (with optional fields if you wish).

If we do compression on Kafka RPC, I would prefer that we do it a more generic 
way that applies to all control messages, not just this one. I also doubt we 
need to support lots and lots of different compression codecs, at first at 
least.

Another thing I'd like to understand is whether we truly need "terminating" (or 
anything like it). I'm still confused about how the backend could use this. 
Keep in mind that we may receive it on multiple brokers (or not receive it at 
all). We may receive more stuff about client XYZ from broker 1 after we have 
already received a "terminated" for client XYZ from broker 2.

> If the broker connection goes down or the connection is to be used for
> other purposes (e.g., blocking FetchRequests), the client will send
> PushTelemetryRequests to any other broker in the cluster, using the same
> ClientInstanceId and SubscriptionId as received in the latest
> GetTelemetrySubscriptionsResponse.
> 
> While the subscriptionId may change during the lifetime of the client
> instance (when metric subscriptions are updated), the ClientInstanceId is
> only acquired once and must not change (as it is used to identify the
> unique client instance).
> ...
> What we do want though is ability to single out a specific client instance
> to give it a more fine-grained subscription for troubleshooting, and
> we can do that with the current proposal with matching solely on the
> CLIENT_INSTANCE_ID.
> In other words; all clients will have the same standard metrics
> subscription, but specific client instances can have alternate
> subscriptions.

That makes sense, and gives a good reason why we might want to couple finding 
the metrics info to passing the client UUID.

> The metrics collector/tsdb/whatever will need to identify a single client
> instance, regardless of which broker received the metrics.
> The chapter on CLIENT_INSTANCE_ID motivates why we need a 

Re: [DISCUSS] KIP-714: Client metrics and observability

2021-09-20 Thread Feng Min
LGTM in terms of RPC separation and the new SubscriptionId to detect target
metric change on the server side.

On Tue, Sep 14, 2021 at 12:48 AM Magnus Edenhill  wrote:

> Thanks for your feedback Colin, see my updated proposal below.
>
>
> Den tors 22 juli 2021 kl 03:17 skrev Colin McCabe :
>
> > On Tue, Jun 29, 2021, at 07:22, Magnus Edenhill wrote:
> > > Den tors 17 juni 2021 kl 00:52 skrev Colin McCabe  >:
> > > > A few critiques:
> > > >
> > > > - As I wrote above, I think this could benefit a lot by being split
> > into
> > > > several RPCs. A registration RPC, a report RPC, and an unregister RPC
> > seem
> > > > like logical choices.
> > > >
> > >
> > > Responded to this in your previous mail, but in short I think a single
> > > request is sufficient and keeps the implementation complexity / state
> > down.
> > >
> >
> > Hi Magnus,
> >
> > I still suspect that trying to do everything with a single RPC is more
> > complex than using multiple RPCs.
> >
> > Can you go into more detail about how the client learns what metrics it
> > should send? This was the purpose of the "registration" step in my scheme
> > above.
> >
> > It seems quite awkward to combine an RPC for reporting metrics with and
> > RPC for finding out what metrics are configured to be reported. For
> > example, how would you build a tool to check what metrics are configured
> to
> > be reported? Does the tool have to report fake metrics, just because
> > there's no other way to get back that information? Seems wrong. (It would
> > be a bit like combining createTopics and listTopics for "simplicity")
> >
>
>
>
> Splitting up the API into separate data and control requests makes sense.
> With a split we would have one API for querying the broker for configured
> metrics subscriptions,
> and one API for pushing the collected metrics to the broker.
>
> A mechanism is still needed to notify the client when the subscription is
> changed;
> I’ve added a SubscriptionId for this purpose (which could be a checksum of
> the configured metrics subscription), this id is sent to the client along
> with the metrics subscription, and the client sends it back to the broker
> when pushing metrics. If the broker finds the pushed subscription id to
> differ from what is expected it will return an error to the client, which
> triggers the client to retrieve the new subscribed metrics and an updated
> subscription id. The generation of the subscriptionId is opaque to the
> client.
>
>
> Something like this:
>
> // Get the configured metrics subscription.
> GetTelemetrySubscriptionsRequest {
>StrNull  ClientInstanceId  // Null on first invocation to retrieve a
> newly generated instance id from the broker.
> }
>
> GetTelemetrySubscriptionsResponse {
>   Int16  ErrorCode
>   Int32  SubscriptionId   // This is used for comparison in
> PushTelemetryRequest. Could be a crc32 of the subscription.
>   StrClientInstanceId
>   Int8   AcceptedContentTypes
>   Array  SubscribedMetrics[] {
>   String MetricsPrefix
>   Int32  IntervalMs
>   }
> }
>
>
> The ContentType is a bitmask in this new proposal, high bits indicate
> compression:
>   0x01   OTLPv08
>   0x10   GZIP
>   0x40   ZSTD
>   0x80   LZ4
>
>
> // Push metrics
> PushTelemetryRequest {
>StrClientInstanceId
>Int32  SubscriptionId// The collected metrics in this request are
> based on the subscription with this Id.
>Int8   ContentType   // E.g., OTLPv08|ZSTD
>Bool   Terminating
>Binary Metrics
> }
>
>
> PushTelemetryResponse {
>Int32 ThrottleTime
>Int16 ErrorCode
> }
>
>
> An example run:
>
> 1. Client instance starts, connects to broker.
> 2. > GetTelemetrySubscriptionsRequest{ ClientInstanceId=Null } // Requests
> an instance id and the subscribed metrics.
> 3. < GetTelemetrySubscriptionsResponse{
>   ErrorCode = 0,
>   SubscriptionId = 0x234adf34,
>   ClientInstanceId = f00d-feed-deff-ceff--…,
>   AcceptedContentTypes = OTLPv08|ZSTD|LZ4,
>   SubscribeddMetrics[] = {
>  { “client.producer.tx.”, 6 },
>  { “client.memory.rss”, 90 },
>   }
>}
> 4. Client updates its metrics subscription, next push to fire in 60
> seconds.
> 5. 60 seconds passes
> 6. > PushTelemetryRequest{
>ClientInstanceId = f00d-feed-deff-ceff--….,
>SubscriptionId = 0x234adf34,
>ContentType = OTLPv8|ZSTD,
>Terminating = False,
>Metrics = …// zstd-compressed OTLPv08-protobuf-serialized metrics
>   }
> 7. < PushTelemetryResponse{ 0, NO_ERROR }
> 8. 60 seconds passes
> 9. > PushTelemetryRequest…
> …
> 56. The operator changes the configured metrics subscriptions (through
> Admin API).
> 57. > PushTelemetryRequest{ .. SubscriptionId = 0x234adf34 .. }
> 58. The subscriptionId no longer matches since the subscription has been
> updated, broker responds with an error:
> 59. < PushTelemetryResponse{ 0,   ERR_INVALID_SUBSCRIPTION_ID }
> 60. The error triggers the client to request the subscriptions 

Re: [DISCUSS] KIP-714: Client metrics and observability

2021-09-14 Thread Magnus Edenhill
Thanks for your feedback Colin, see my updated proposal below.


Den tors 22 juli 2021 kl 03:17 skrev Colin McCabe :

> On Tue, Jun 29, 2021, at 07:22, Magnus Edenhill wrote:
> > Den tors 17 juni 2021 kl 00:52 skrev Colin McCabe :
> > > A few critiques:
> > >
> > > - As I wrote above, I think this could benefit a lot by being split
> into
> > > several RPCs. A registration RPC, a report RPC, and an unregister RPC
> seem
> > > like logical choices.
> > >
> >
> > Responded to this in your previous mail, but in short I think a single
> > request is sufficient and keeps the implementation complexity / state
> down.
> >
>
> Hi Magnus,
>
> I still suspect that trying to do everything with a single RPC is more
> complex than using multiple RPCs.
>
> Can you go into more detail about how the client learns what metrics it
> should send? This was the purpose of the "registration" step in my scheme
> above.
>
> It seems quite awkward to combine an RPC for reporting metrics with and
> RPC for finding out what metrics are configured to be reported. For
> example, how would you build a tool to check what metrics are configured to
> be reported? Does the tool have to report fake metrics, just because
> there's no other way to get back that information? Seems wrong. (It would
> be a bit like combining createTopics and listTopics for "simplicity")
>



Splitting up the API into separate data and control requests makes sense.
With a split we would have one API for querying the broker for configured
metrics subscriptions,
and one API for pushing the collected metrics to the broker.

A mechanism is still needed to notify the client when the subscription is
changed;
I’ve added a SubscriptionId for this purpose (which could be a checksum of
the configured metrics subscription), this id is sent to the client along
with the metrics subscription, and the client sends it back to the broker
when pushing metrics. If the broker finds the pushed subscription id to
differ from what is expected it will return an error to the client, which
triggers the client to retrieve the new subscribed metrics and an updated
subscription id. The generation of the subscriptionId is opaque to the
client.


Something like this:

// Get the configured metrics subscription.
GetTelemetrySubscriptionsRequest {
   StrNull  ClientInstanceId  // Null on first invocation to retrieve a
newly generated instance id from the broker.
}

GetTelemetrySubscriptionsResponse {
  Int16  ErrorCode
  Int32  SubscriptionId   // This is used for comparison in
PushTelemetryRequest. Could be a crc32 of the subscription.
  StrClientInstanceId
  Int8   AcceptedContentTypes
  Array  SubscribedMetrics[] {
  String MetricsPrefix
  Int32  IntervalMs
  }
}


The ContentType is a bitmask in this new proposal, high bits indicate
compression:
  0x01   OTLPv08
  0x10   GZIP
  0x40   ZSTD
  0x80   LZ4


// Push metrics
PushTelemetryRequest {
   StrClientInstanceId
   Int32  SubscriptionId// The collected metrics in this request are
based on the subscription with this Id.
   Int8   ContentType   // E.g., OTLPv08|ZSTD
   Bool   Terminating
   Binary Metrics
}


PushTelemetryResponse {
   Int32 ThrottleTime
   Int16 ErrorCode
}


An example run:

1. Client instance starts, connects to broker.
2. > GetTelemetrySubscriptionsRequest{ ClientInstanceId=Null } // Requests
an instance id and the subscribed metrics.
3. < GetTelemetrySubscriptionsResponse{
  ErrorCode = 0,
  SubscriptionId = 0x234adf34,
  ClientInstanceId = f00d-feed-deff-ceff--…,
  AcceptedContentTypes = OTLPv08|ZSTD|LZ4,
  SubscribeddMetrics[] = {
 { “client.producer.tx.”, 6 },
 { “client.memory.rss”, 90 },
  }
   }
4. Client updates its metrics subscription, next push to fire in 60 seconds.
5. 60 seconds passes
6. > PushTelemetryRequest{
   ClientInstanceId = f00d-feed-deff-ceff--….,
   SubscriptionId = 0x234adf34,
   ContentType = OTLPv8|ZSTD,
   Terminating = False,
   Metrics = …// zstd-compressed OTLPv08-protobuf-serialized metrics
  }
7. < PushTelemetryResponse{ 0, NO_ERROR }
8. 60 seconds passes
9. > PushTelemetryRequest…
…
56. The operator changes the configured metrics subscriptions (through
Admin API).
57. > PushTelemetryRequest{ .. SubscriptionId = 0x234adf34 .. }
58. The subscriptionId no longer matches since the subscription has been
updated, broker responds with an error:
59. < PushTelemetryResponse{ 0,   ERR_INVALID_SUBSCRIPTION_ID }
60. The error triggers the client to request the subscriptions again.
61. > GetTelemetrySubscriptionsRequest{..}
62. < GetTelemetrySubscriptionsResponse { .. SubscriptionId = 0x72211,
SubscribedMetrics[] = .. }
63. Client update its subscription and continues to push metrics
accordingly.
…


If the broker connection goes down or the connection is to be used for
other purposes (e.g., blocking FetchRequests), the client will send
PushTelemetryRequests to any other broker in the cluster, 

  1   2   >