Re: [DISCUSS] KIP-714: Client metrics and observability

Jun Rao Tue, 21 Jun 2022 17:24:34 -0700

Hi, Magnus, Kirk,

Thanks for the reply. A few more comments on your reply.


100. I agree there are some benefits of having a set of standard metrics
across all clients, but I am just wondering how practical it is, given that
the proposal doesn't require this set like the Kafka protocol.
100.1 A client may not implement all or some of the standard metrics. Then,
we won't have complete standardized names across clients.
100.2 The set of standard metrics needs to be common across all clients.
For example, client.consumer.poll.latency implies that all clients
implement a poll() interface. Is that true for all clients?
client.producer.record.queue.bytes. Do all producers have queues? We
probably need to make a pass of those metrics to see if they are indeed
common across all clients. Also, a bunch of standard metrics have type
Histogram. Java client doesn't have good Histogram support yet. I am also
not sure if all clients support Histogram. Should we avoid Histogram type
in standardized metrics?
100.3 For a subset of metrics that are truly common across clients, it
would be confusing for each client to maintain two sets of metrics for the
same thing. We could document them, but that means every user of every
client needs to remember this mapping. This is a much bigger
inconvenience than standardizing the metric names on the server side. If we
want to go this route, my preference is to deprecate the existing metric
names that are covered by the standard metric names.

101. "or if the client-specific metrics are not converted to some common
form, name, semantic, etc, it'll make creating meaningful aggregations and
monitoring more complex in the upstream telemetry system with a scattered
plethora of custom metrics." There will always be client specific metrics.
So, it seems that we have to deal with scattered custom metrics even with a
set of standard metrics.

102. "However, in, let's say the Apache Kafka 3.7 release, the metric name
is changed to "connections.open.count." At this point, there are two names
and machine-to-machine communication will likely be effected. With that
change, all client telemetry plugin(s) used in an organization must be
updated to reflect that change, else data loss or bugs could be
introduced." The standard metric names could change too in the future,
right? So, we need to deal with a similar problem if that happens.

103. "Are there any inobvious security/privacy-related edge cases where
shipping certain metrics to the broker would be "bad?"" I am not sure. But
if a metric can be shipped to the server, it would be useful for the same
metric to be visible on the client side.

Thanks,

Jun


On Tue, Jun 21, 2022 at 8:19 AM Kirk True <[email protected]> wrote:

> Hi Jun,
>
> Thank you for all your continued interest in shaping the KIP :)
>
> On Thu, Jun 16, 2022, at 2:38 PM, Jun Rao wrote:
> > Hi, Kirk,
> >
> > Thanks for the reply. A couple of more comments.
> >
> > (1) "Another perspective is that these two sets of metrics serve
> different
> > purposes and/or have different audiences, which argues that they should
> > maintain their individuality and purpose. " Hmm, I am wondering if those
> > metrics are really for different audiences and purposes? For example, if
> > the operator detected an issue through a client metric collected through
> > the server, the operator may need to communicate that back to the client.
> > It would be weird if that same metric is not visible on the client side.
>
> I agree in the principal that all client metrics visible on the client can
> also be available to be sent to the broker.
>
> Are there any inobvious security/privacy-related edge cases where shipping
> certain metrics to the broker would be "bad?"
>
> > (2) If we could standardize the names on the server side, do we need to
> > enforce a naming convention for all clients?
>
> "Enforce" is such an ugly word :P
>
> But yes, I do feel that a consistent naming convention across all clients
> provides communication benefits between two entities:
>
>  1. Human-to-human communication. Ecosystem-wide agreement and
> understanding of metrics helps all to communicate more efficiently.
>  2. Machine-to-machine communication. Defining the names via the KIP
> mechanism help to ensure stabilization across releases of a given client.
>
> Point 1: Human-to-human Communication
>
> There are quite a handful of parties that must communicate effectively
> across the Kafka ecosystem. Here are the ones I can think of off the top of
> my head:
>
>  1. Kafka client authors
>  2. Kafka client users
>  3. Kafka client telemetry plugin authors
>  4. Support teams (within an organization or vendor-supplied across
> organizations)
>  5. Kafka cluster operators
>
> There should be a standard so that these parties can understand the
> metrics' meaning and be able to correlate that across all clients.
>
> As a concrete example, KIP-714 includes a metric for tracking the number
> of active client connections to a cluster, named
> "org.apache.kafka.client.connection.active." Given this name, all client
> implementations can communicate this name and its value to all parties
> consistently. Without a standard naming convention, the metric might be
> named "connections.open" in the Java client and "Connections/Alive" in
> librdkafka. This inconsistency of naming would impact the discussions
> between one or more of the parties involved.
>
> To your point, it's absolutely a design choice to keep the naming
> convention the same between each client. We can change that if it makes
> sense.
>
> Point 2: Machine-to-machine Communication
>
> Standardization at the client level provides stability through an implied
> contract that a client should not introduce a breaking name change between
> releases. Otherwise, the ability for the metrics to be "understood" in a
> machine-to-machine context would be forfeit.
>
> For example, let's say that we give the clients the latitude to name
> metrics as they wish. In this example, let's say that the Apache Kafka 3.4
> release decides to name this metric "connections.open." It's a good name!
> It says what it is. However, in, let's say the Apache Kafka 3.7 release,
> the metric name is changed to "connections.open.count." At this point,
> there are two names and machine-to-machine communication will likely be
> effected. With that change, all client telemetry plugin(s) used in an
> organization must be updated to reflect that change, else data loss or bugs
> could be introduced.
>
> That the KIP defines the names of the metrics does, admittedly, constrain
> the options of authors of the different clients. The metric named
> "org.apache.kafka.client.connection.active" may be confusing in some client
> implementations. For whatever reason, a client author may even find it
> "undesirable" to include a reference that includes "Apache" in their code.
>
> There's also the precedent set by the existing (JMX-based) client metrics.
> Though these are applicable only to the Java client, we can see that having
> a standardized naming convention there has helped with communication.
>
> So, IMO, it makes sense to define the metric names via the KIP mechanism
> and--let's say, "ask"--that client implementations abide by those.
>
> Thanks,
> Kirk
>
> >
> > Thanks,
> >
> > Jun
> >
> > On Thu, Jun 16, 2022 at 12:00 PM Kirk True <[email protected]> wrote:
> >
> > > Hi Jun,
> > >
> > > I'll try to answer the questions posed...
> > >
> > > On Tue, Jun 7, 2022, at 4:32 PM, Jun Rao wrote:
> > > > Hi, Magnus,
> > > >
> > > > Thanks for the reply.
> > > >
> > > > So, the standard set of generic metrics is just a recommendation and
> not
> > > a
> > > > requirement? This sounds good to me since it makes the adoption of
> the
> > > KIP
> > > > easier.
> > >
> > > I believe that was the intent, yes.
> > >
> > > > Regarding the metric names, I have two concerns.
> > >
> > > (I'm splitting these two up for readability...)
> > >
> > > > (1) If a client already
> > > > has an existing metric similar to the standard one, duplicating the
> > > metric
> > > > seems to be confusing.
> > >
> > > Agreed. I'm dealing with that situation as I write the Java client
> > > implementation.
> > >
> > > The existing Java client exposes a set of metrics via JMX. The updated
> > > Java client will introduce a second set of metrics, which instead are
> > > exposed via sending them to the broker. There is substantial overlap
> with
> > > the two set of metrics and in a few places in the code under
> development,
> > > there are essentially two separate calls to update metrics: one for the
> > > JMX-bound metrics and one for the broker-bound metrics.
> > >
> > > To be candid, I have gone back-and-forth on that design. From one
> > > perspective, it could be argued that the set of client metrics should
> be
> > > standardized across a given client, regardless of how those metrics are
> > > exposed for consumption. Another perspective is that these two sets of
> > > metrics serve different purposes and/or have different audiences, which
> > > argues that they should maintain their individuality and purpose. Your
> > > inputs/suggestions are certainly welcome!
> > >
> > > > (2) If a client needs to implement a standard metric
> > > > that doesn't exist yet, using a naming convention (e.g., using dash
> vs
> > > dot)
> > > > different from other existing metrics also seems a bit confusing. It
> > > seems
> > > > that the main benefit of having standard metric names across clients
> is
> > > for
> > > > better server side monitoring. Could we do the standardization in the
> > > > plugin on the server?
> > >
> > > I think the expectation is that the plugin implementation will perform
> > > transformation of metric names, if needed, to fit in with an
> organization's
> > > monitoring naming standards. Perhaps we need to call that out in the
> KIP
> > > itself.
> > >
> > > Thanks,
> > > Kirk
> > >
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > >
> > > >
> > > > On Tue, Jun 7, 2022 at 6:53 AM Magnus Edenhill <[email protected]>
> > > wrote:
> > > >
> > > > > Hey Jun,
> > > > >
> > > > > I've clarified the scope of the standard metrics in the KIP, but
> > > basically:
> > > > >
> > > > >  * We define a standard set of generic metrics that should be
> relevant
> > > to
> > > > > most client implementations, e.g., each producer implementation
> > > probably
> > > > > has some sort of per-partition message queue.
> > > > >  * A client implementation should strive to implement as many of
> the
> > > > > standard metrics as possible, but only the ones that make sense.
> > > > >  * For metrics that are not in the standard set, a client
> maintainer
> > > can
> > > > > choose to either submit a KIP to add additional standard metrics -
> if
> > > > > they're relevant, or go ahead and add custom metrics that are
> specific
> > > to
> > > > > that client implementation. These custom metrics will have a prefix
> > > > > specific to that client implementation, as opposed to the standard
> > > metric
> > > > > set that resides under "org.apache.kafka...". E.g.,
> > > > > "se.edenhill.librdkafka" or whatever.
> > > > >  * Existing non-KIP-714 metrics should remain untouched. In some
> cases
> > > we
> > > > > might be able to use the same meter given it is compatible with the
> > > > > standard metric set definition, in other cases a semi-duplicate
> meter
> > > may
> > > > > be needed. Thus this will not affect the metrics exposed through
> JMX,
> > > or
> > > > > vice versa.
> > > > >
> > > > > Thanks,
> > > > > Magnus
> > > > >
> > > > >
> > > > >
> > > > > Den ons 1 juni 2022 kl 18:55 skrev Jun Rao
> <[email protected]>:
> > > > >
> > > > > > Hi, Magnus,
> > > > > >
> > > > > > 51. Just to clarify my question.  (1) Are standard metrics
> required
> > > for
> > > > > > every client for this KIP to function?  (2) Are we converting
> > > existing
> > > > > java
> > > > > > metrics to the standard metrics and deprecating the old ones? If
> so,
> > > > > could
> > > > > > we list all existing java metrics that need to be renamed and the
> > > > > > corresponding new name?
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Tue, May 31, 2022 at 3:29 PM Jun Rao <[email protected]>
> wrote:
> > > > > >
> > > > > > > Hi, Magnus,
> > > > > > >
> > > > > > > Thanks for the reply.
> > > > > > >
> > > > > > > 51. I think it's fine to have a list of recommended metrics for
> > > every
> > > > > > > client to implement. I am just not sure that standardizing on
> the
> > > > > metric
> > > > > > > names across all clients is practical. The list of common
> metrics
> > > in
> > > > > the
> > > > > > > KIP have completely different names from the java metric names.
> > > Some of
> > > > > > > them have different types. For example, some of the common
> metrics
> > > > > have a
> > > > > > > type of histogram, but the java client metrics don't use
> histogram
> > > in
> > > > > > > general. Requiring the operator to translate those names and
> > > understand
> > > > > > the
> > > > > > > subtle differences across clients seem to cause more confusion
> > > during
> > > > > > > troubleshooting.
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Jun
> > > > > > >
> > > > > > > On Tue, May 31, 2022 at 5:02 AM Magnus Edenhill <
> > > [email protected]>
> > > > > > > wrote:
> > > > > > >
> > > > > > >> Den fre 20 maj 2022 kl 01:23 skrev Jun Rao
> > > <[email protected]
> > > > > >:
> > > > > > >>
> > > > > > >> > Hi, Magus,
> > > > > > >> >
> > > > > > >> > Thanks for the reply.
> > > > > > >> >
> > > > > > >> > 50. Sounds good.
> > > > > > >> >
> > > > > > >> > 51. I miss-understood the proposal in the KIP then. The
> > > proposal is
> > > > > to
> > > > > > >> > define a set of common metric names that every client should
> > > > > > implement.
> > > > > > >> The
> > > > > > >> > problem is that every client already has its own set of
> metrics
> > > with
> > > > > > its
> > > > > > >> > own names. I am not sure that we could easily agree upon a
> > > common
> > > > > set
> > > > > > of
> > > > > > >> > metrics that work with all clients. There are likely to be
> some
> > > > > > metrics
> > > > > > >> > that are client specific. Translating between the common
> name
> > > and
> > > > > > client
> > > > > > >> > specific name is probably going to add more confusion. As
> > > mentioned
> > > > > in
> > > > > > >> the
> > > > > > >> > KIP, similar metrics from different clients could have
> subtle
> > > > > > >> > semantic differences. Could we just let each client use its
> own
> > > set
> > > > > of
> > > > > > >> > metric names?
> > > > > > >> >
> > > > > > >>
> > > > > > >> We identified a common set of metrics that should be relevant
> for
> > > most
> > > > > > >> client implementations,
> > > > > > >> they're the ones listed in the KIP.
> > > > > > >> A supporting client does not have to implement all those
> metrics,
> > > only
> > > > > > the
> > > > > > >> ones that makes sense
> > > > > > >> based on that client implementation, and a client may
> implement
> > > other
> > > > > > >> metrics that are not listed
> > > > > > >> in the KIP under its own namespace.
> > > > > > >> This approach has two benefits:
> > > > > > >>  - there will be a common set of metrics that most/all clients
> > > > > > implement,
> > > > > > >> which makes monitoring
> > > > > > >>   and troubleshooting easier across fleets with multiple Kafka
> > > client
> > > > > > >> languages/implementations.
> > > > > > >>  - client-specific metrics are still possible, so if there is
> no
> > > > > > suitable
> > > > > > >> standard metric a client can still
> > > > > > >>    provide what special metrics it has.
> > > > > > >>
> > > > > > >>
> > > > > > >> Thanks,
> > > > > > >> Magnus
> > > > > > >>
> > > > > > >>
> > > > > > >> On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill <
> > > [email protected]>
> > > > > > >> wrote:
> > > > > > >> >
> > > > > > >> > > Den ons 18 maj 2022 kl 19:57 skrev Jun Rao
> > > > > <[email protected]
> > > > > > >> >:
> > > > > > >> > >
> > > > > > >> > > > Hi, Magnus,
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> > > Hi Jun
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > >
> > > > > > >> > > > Thanks for the updated KIP. Just a couple of more
> comments.
> > > > > > >> > > >
> > > > > > >> > > > 50. To troubleshoot a particular client issue, I imagine
> > > that
> > > > > the
> > > > > > >> > client
> > > > > > >> > > > needs to identify its client_instance_id. How does the
> > > client
> > > > > find
> > > > > > >> this
> > > > > > >> > > > out? Do we plan to include client_instance_id in the
> client
> > > log,
> > > > > > >> expose
> > > > > > >> > > it
> > > > > > >> > > > as a metric or something else?
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> > > The KIP suggests that client implementations emit an
> > > informative
> > > > > log
> > > > > > >> > > message
> > > > > > >> > > with the assigned client-instance-id once it is retrieved
> > > (once
> > > > > per
> > > > > > >> > client
> > > > > > >> > > instance lifetime).
> > > > > > >> > > There's also a clientInstanceId() method that an
> application
> > > can
> > > > > use
> > > > > > >> to
> > > > > > >> > > retrieve
> > > > > > >> > > the client instance id and emit through whatever side
> channels
> > > > > makes
> > > > > > >> > sense.
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > > 51. The KIP lists a bunch of metrics that need to be
> > > collected
> > > > > at
> > > > > > >> the
> > > > > > >> > > > client side. However, it seems quite a few useful java
> > > client
> > > > > > >> metrics
> > > > > > >> > > like
> > > > > > >> > > > the following are missing.
> > > > > > >> > > >     buffer-total-bytes
> > > > > > >> > > >     buffer-available-bytes
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> > > These are covered by client.producer.record.queue.bytes
> and
> > > > > > >> > > client.producer.record.queue.max.bytes.
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > >     bufferpool-wait-time
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> > > Missing, but somewhat implementation specific.
> > > > > > >> > > If it was up to me we would add this later if there's a
> need.
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > >     batch-size-avg
> > > > > > >> > > >     batch-size-max
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> > > These are missing and would be suitably represented as a
> > > > > histogram.
> > > > > > >> I'll
> > > > > > >> > > add them.
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > >     io-wait-ratio
> > > > > > >> > > >     io-ratio
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> > > There's client.io.wait.time which should cover
> io-wait-ratio.
> > > > > > >> > > We could add a client.io.time as well, now or in a later
> KIP.
> > > > > > >> > >
> > > > > > >> > > Thanks,
> > > > > > >> > > Magnus
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > >
> > > > > > >> > > > Thanks,
> > > > > > >> > > >
> > > > > > >> > > > Jun
> > > > > > >> > > >
> > > > > > >> > > > On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <
> [email protected]>
> > > > > wrote:
> > > > > > >> > > >
> > > > > > >> > > > > Hi, Xavier,
> > > > > > >> > > > >
> > > > > > >> > > > > Thanks for the reply.
> > > > > > >> > > > >
> > > > > > >> > > > > 28. It does seem that we have started using
> KafkaMetrics
> > > on
> > > > > the
> > > > > > >> > broker
> > > > > > >> > > > > side. Then, my only concern is on the usage of
> Histogram
> > > in
> > > > > > >> > > KafkaMetrics.
> > > > > > >> > > > > Histogram in KafkaMetrics statically divides the value
> > > space
> > > > > > into
> > > > > > >> a
> > > > > > >> > > fixed
> > > > > > >> > > > > number of buckets and only returns values on the
> bucket
> > > > > > boundary.
> > > > > > >> So,
> > > > > > >> > > the
> > > > > > >> > > > > returned histogram value may never show up in a
> recorded
> > > > > value.
> > > > > > >> > Yammer
> > > > > > >> > > > > Histogram, on the other hand, uses reservoir
> sampling. The
> > > > > > >> reported
> > > > > > >> > > value
> > > > > > >> > > > > is always one of the recorded values. So, I am not
> sure
> > > that
> > > > > > >> > Histogram
> > > > > > >> > > in
> > > > > > >> > > > > KafkaMetrics is as good as Yammer Histogram.
> > > > > > >> > > > ClientMetricsPluginExportTime
> > > > > > >> > > > > uses Histogram.
> > > > > > >> > > > >
> > > > > > >> > > > > Thanks,
> > > > > > >> > > > >
> > > > > > >> > > > > Jun
> > > > > > >> > > > >
> > > > > > >> > > > > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
> > > > > > >> > > > <[email protected]>
> > > > > > >> > > > > wrote:
> > > > > > >> > > > >
> > > > > > >> > > > >> >
> > > > > > >> > > > >> > 28. On the broker, we typically use Yammer metrics.
> > > Only
> > > > > for
> > > > > > >> > metrics
> > > > > > >> > > > >> that
> > > > > > >> > > > >> > depend on Kafka metric features (e.g., quota), we
> use
> > > the
> > > > > > Kafka
> > > > > > >> > > > metric.
> > > > > > >> > > > >> > Yammer metrics have 4 types: gauge, meter,
> histogram
> > > and
> > > > > > timer.
> > > > > > >> > > meter
> > > > > > >> > > > >> > calculates a rate, but also exposes an accumulated
> > > value.
> > > > > > >> > > > >> >
> > > > > > >> > > > >>
> > > > > > >> > > > >> I don't see a good reason we should limit ourselves
> to
> > > Yammer
> > > > > > >> > metrics
> > > > > > >> > > on
> > > > > > >> > > > >> the broker. KafkaMetrics was written
> > > > > > >> > > > >> to replace Yammer metrics and is used for all new
> > > components
> > > > > > >> > (clients,
> > > > > > >> > > > >> streams, connect, etc.)
> > > > > > >> > > > >> My understanding is that the original goal was to
> retire
> > > > > Yammer
> > > > > > >> > > metrics
> > > > > > >> > > > in
> > > > > > >> > > > >> the broker in favor of KafkaMetrics.
> > > > > > >> > > > >> We just haven't done so out of backwards
> compatibility
> > > > > > concerns.
> > > > > > >> > > > >> There are other broker metrics such as group
> coordinator,
> > > > > > >> > transaction
> > > > > > >> > > > >> state
> > > > > > >> > > > >> manager, and various socket server metrics
> > > > > > >> > > > >> already using KafkaMetrics that don't need specific
> Kafka
> > > > > > metric
> > > > > > >> > > > features,
> > > > > > >> > > > >> so I don't see why we should refrain from using
> > > > > > >> > > > >> Kafka metrics on the broker unless there are real
> > > > > compatibility
> > > > > > >> > > concerns
> > > > > > >> > > > >> or
> > > > > > >> > > > >> where implementation specifics could lead to
> confusion
> > > when
> > > > > > >> > comparing
> > > > > > >> > > > >> metrics using different implementations.
> > > > > > >> > > > >>
> > > > > > >> > > > >> In my opinion we should encourage people to use
> > > KafkaMetrics
> > > > > > >> going
> > > > > > >> > > > forward
> > > > > > >> > > > >> on the broker as well, for two reasons:
> > > > > > >> > > > >> a) yammer metrics is long deprecated and no longer
> > > maintained
> > > > > > >> > > > >> b) yammer metrics are much less expressive
> > > > > > >> > > > >> c) we don't have a proper API to expose yammer
> metrics
> > > > > outside
> > > > > > of
> > > > > > >> > JMX
> > > > > > >> > > > >> (MetricsReporter only exposes KafkaMetrics)
> > > > > > >> > > > >>
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Reply via email to