Hey Jun, I've clarified the scope of the standard metrics in the KIP, but basically:
* We define a standard set of generic metrics that should be relevant to most client implementations, e.g., each producer implementation probably has some sort of per-partition message queue. * A client implementation should strive to implement as many of the standard metrics as possible, but only the ones that make sense. * For metrics that are not in the standard set, a client maintainer can choose to either submit a KIP to add additional standard metrics - if they're relevant, or go ahead and add custom metrics that are specific to that client implementation. These custom metrics will have a prefix specific to that client implementation, as opposed to the standard metric set that resides under "org.apache.kafka...". E.g., "se.edenhill.librdkafka" or whatever. * Existing non-KIP-714 metrics should remain untouched. In some cases we might be able to use the same meter given it is compatible with the standard metric set definition, in other cases a semi-duplicate meter may be needed. Thus this will not affect the metrics exposed through JMX, or vice versa. Thanks, Magnus Den ons 1 juni 2022 kl 18:55 skrev Jun Rao <j...@confluent.io.invalid>: > Hi, Magnus, > > 51. Just to clarify my question. (1) Are standard metrics required for > every client for this KIP to function? (2) Are we converting existing java > metrics to the standard metrics and deprecating the old ones? If so, could > we list all existing java metrics that need to be renamed and the > corresponding new name? > > Thanks, > > Jun > > On Tue, May 31, 2022 at 3:29 PM Jun Rao <j...@confluent.io> wrote: > > > Hi, Magnus, > > > > Thanks for the reply. > > > > 51. I think it's fine to have a list of recommended metrics for every > > client to implement. I am just not sure that standardizing on the metric > > names across all clients is practical. The list of common metrics in the > > KIP have completely different names from the java metric names. Some of > > them have different types. For example, some of the common metrics have a > > type of histogram, but the java client metrics don't use histogram in > > general. Requiring the operator to translate those names and understand > the > > subtle differences across clients seem to cause more confusion during > > troubleshooting. > > > > Thanks, > > > > Jun > > > > On Tue, May 31, 2022 at 5:02 AM Magnus Edenhill <mag...@edenhill.se> > > wrote: > > > >> Den fre 20 maj 2022 kl 01:23 skrev Jun Rao <j...@confluent.io.invalid>: > >> > >> > Hi, Magus, > >> > > >> > Thanks for the reply. > >> > > >> > 50. Sounds good. > >> > > >> > 51. I miss-understood the proposal in the KIP then. The proposal is to > >> > define a set of common metric names that every client should > implement. > >> The > >> > problem is that every client already has its own set of metrics with > its > >> > own names. I am not sure that we could easily agree upon a common set > of > >> > metrics that work with all clients. There are likely to be some > metrics > >> > that are client specific. Translating between the common name and > client > >> > specific name is probably going to add more confusion. As mentioned in > >> the > >> > KIP, similar metrics from different clients could have subtle > >> > semantic differences. Could we just let each client use its own set of > >> > metric names? > >> > > >> > >> We identified a common set of metrics that should be relevant for most > >> client implementations, > >> they're the ones listed in the KIP. > >> A supporting client does not have to implement all those metrics, only > the > >> ones that makes sense > >> based on that client implementation, and a client may implement other > >> metrics that are not listed > >> in the KIP under its own namespace. > >> This approach has two benefits: > >> - there will be a common set of metrics that most/all clients > implement, > >> which makes monitoring > >> and troubleshooting easier across fleets with multiple Kafka client > >> languages/implementations. > >> - client-specific metrics are still possible, so if there is no > suitable > >> standard metric a client can still > >> provide what special metrics it has. > >> > >> > >> Thanks, > >> Magnus > >> > >> > >> On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill <mag...@edenhill.se> > >> wrote: > >> > > >> > > Den ons 18 maj 2022 kl 19:57 skrev Jun Rao <j...@confluent.io.invalid > >> >: > >> > > > >> > > > Hi, Magnus, > >> > > > > >> > > > >> > > Hi Jun > >> > > > >> > > > >> > > > > >> > > > Thanks for the updated KIP. Just a couple of more comments. > >> > > > > >> > > > 50. To troubleshoot a particular client issue, I imagine that the > >> > client > >> > > > needs to identify its client_instance_id. How does the client find > >> this > >> > > > out? Do we plan to include client_instance_id in the client log, > >> expose > >> > > it > >> > > > as a metric or something else? > >> > > > > >> > > > >> > > The KIP suggests that client implementations emit an informative log > >> > > message > >> > > with the assigned client-instance-id once it is retrieved (once per > >> > client > >> > > instance lifetime). > >> > > There's also a clientInstanceId() method that an application can use > >> to > >> > > retrieve > >> > > the client instance id and emit through whatever side channels makes > >> > sense. > >> > > > >> > > > >> > > > >> > > > 51. The KIP lists a bunch of metrics that need to be collected at > >> the > >> > > > client side. However, it seems quite a few useful java client > >> metrics > >> > > like > >> > > > the following are missing. > >> > > > buffer-total-bytes > >> > > > buffer-available-bytes > >> > > > > >> > > > >> > > These are covered by client.producer.record.queue.bytes and > >> > > client.producer.record.queue.max.bytes. > >> > > > >> > > > >> > > > bufferpool-wait-time > >> > > > > >> > > > >> > > Missing, but somewhat implementation specific. > >> > > If it was up to me we would add this later if there's a need. > >> > > > >> > > > >> > > > >> > > > batch-size-avg > >> > > > batch-size-max > >> > > > > >> > > > >> > > These are missing and would be suitably represented as a histogram. > >> I'll > >> > > add them. > >> > > > >> > > > >> > > > >> > > > io-wait-ratio > >> > > > io-ratio > >> > > > > >> > > > >> > > There's client.io.wait.time which should cover io-wait-ratio. > >> > > We could add a client.io.time as well, now or in a later KIP. > >> > > > >> > > Thanks, > >> > > Magnus > >> > > > >> > > > >> > > > >> > > > >> > > > > >> > > > Thanks, > >> > > > > >> > > > Jun > >> > > > > >> > > > On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <j...@confluent.io> wrote: > >> > > > > >> > > > > Hi, Xavier, > >> > > > > > >> > > > > Thanks for the reply. > >> > > > > > >> > > > > 28. It does seem that we have started using KafkaMetrics on the > >> > broker > >> > > > > side. Then, my only concern is on the usage of Histogram in > >> > > KafkaMetrics. > >> > > > > Histogram in KafkaMetrics statically divides the value space > into > >> a > >> > > fixed > >> > > > > number of buckets and only returns values on the bucket > boundary. > >> So, > >> > > the > >> > > > > returned histogram value may never show up in a recorded value. > >> > Yammer > >> > > > > Histogram, on the other hand, uses reservoir sampling. The > >> reported > >> > > value > >> > > > > is always one of the recorded values. So, I am not sure that > >> > Histogram > >> > > in > >> > > > > KafkaMetrics is as good as Yammer Histogram. > >> > > > ClientMetricsPluginExportTime > >> > > > > uses Histogram. > >> > > > > > >> > > > > Thanks, > >> > > > > > >> > > > > Jun > >> > > > > > >> > > > > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté > >> > > > <xav...@confluent.io.invalid> > >> > > > > wrote: > >> > > > > > >> > > > >> > > >> > > > >> > 28. On the broker, we typically use Yammer metrics. Only for > >> > metrics > >> > > > >> that > >> > > > >> > depend on Kafka metric features (e.g., quota), we use the > Kafka > >> > > > metric. > >> > > > >> > Yammer metrics have 4 types: gauge, meter, histogram and > timer. > >> > > meter > >> > > > >> > calculates a rate, but also exposes an accumulated value. > >> > > > >> > > >> > > > >> > >> > > > >> I don't see a good reason we should limit ourselves to Yammer > >> > metrics > >> > > on > >> > > > >> the broker. KafkaMetrics was written > >> > > > >> to replace Yammer metrics and is used for all new components > >> > (clients, > >> > > > >> streams, connect, etc.) > >> > > > >> My understanding is that the original goal was to retire Yammer > >> > > metrics > >> > > > in > >> > > > >> the broker in favor of KafkaMetrics. > >> > > > >> We just haven't done so out of backwards compatibility > concerns. > >> > > > >> There are other broker metrics such as group coordinator, > >> > transaction > >> > > > >> state > >> > > > >> manager, and various socket server metrics > >> > > > >> already using KafkaMetrics that don't need specific Kafka > metric > >> > > > features, > >> > > > >> so I don't see why we should refrain from using > >> > > > >> Kafka metrics on the broker unless there are real compatibility > >> > > concerns > >> > > > >> or > >> > > > >> where implementation specifics could lead to confusion when > >> > comparing > >> > > > >> metrics using different implementations. > >> > > > >> > >> > > > >> In my opinion we should encourage people to use KafkaMetrics > >> going > >> > > > forward > >> > > > >> on the broker as well, for two reasons: > >> > > > >> a) yammer metrics is long deprecated and no longer maintained > >> > > > >> b) yammer metrics are much less expressive > >> > > > >> c) we don't have a proper API to expose yammer metrics outside > of > >> > JMX > >> > > > >> (MetricsReporter only exposes KafkaMetrics) > >> > > > >> > >> > > > > > >> > > > > >> > > > >> > > >> > > >