Re: [DISCUSS] KIP-977: Partition-Level Throughput Metrics

Divij Vaidya Mon, 30 Oct 2023 03:26:13 -0700

Hey *Qichao*

Thank you for the update on the KIP. I like the idea of incremental
delivery and adding which metrics support this verbosity as a later KIP.
But I also want to ensure that we wouldn't have to change the current
config when adding that in future. Hence, we need some discussion on it in
the scope of the KIP.


About the dynamic configuration:
Do we need to add the "default" mode? I am asking because it may inhibit us
from adding the allowList option in future. Instead if we could rephrase
the config as: "metric.verbosity.high" which takes values as a regEx
(default will be empty), then we wouldn't have to worry about
future-proofness of this KIP. Notably this is an existing pattern used by
KIP-544.
Alternatively, if you choose to stick to the current configuration pattern,
please provide information on how this config will look like when we add
allow listing in future.

About the perf test:
Motivation - The motivation of perf test is to provide users with a hint on
what perf penalty they can expect and whether default has degraded perf
(due to additional "empty" labels).
Dimensions of the test could be - scrape interval, utilization of broker
(no traffic vs. heavy traffic), number of partitions (small/200 to
large/2k).
Things to collect during perf test - number of mbeans registered with JMX,
CPU, heap utilization
Expected results - As long as we can prove that there is no additional
usage (significant) of CPU or heap after this change for the "default
mode", we should be good. For the "high" mode, we should document the
expected increase for users but it is not a blocker to implement this KIP.


*Kirk*, I have tried to clarify the expectation on performance, does that
address your question earlier? Also, I am happy with having a Kafka level
dynamic config that we can use to filter our metric/dimensionality since we
have a precedence at KIP-544. Hence, my suggestion to push this filtering
to metric library can be ignored.

--
Divij Vaidya



On Sat, Oct 28, 2023 at 11:37 AM Qichao Chu <qic...@uber.com.invalid> wrote:

> Hello Everyone,
>
> Can I ask for some feedback regarding KIP-977
> <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-977%3A+Partition-Level+Throughput+Metrics
> >
> ?
>
> Best,
> Qichao Chu
> Software Engineer | Data - Kafka
> [image: Uber] <https://uber.com/>
>
>
> On Mon, Oct 16, 2023 at 7:34 PM Qichao Chu <qic...@uber.com> wrote:
>
> > Hi Divij and Kirk,
> >
> > Thank you both for providing the valuable feedback and sorry for the
> > delay. I have just updated the KIP to address the comments.
> >
> >    1. Instead of using a topic-level control, global verbosity control
> >    makes more sense if we want to extend it in the future. It would be
> very
> >    difficult if we want to apply the topic allowlist everywhere
> >    2. Also, the topic allowlist was not dynamic which makes everything
> >    quite complex, especially for the topic lifecycle management. By
> using the
> >    dynamic global config, debugging could be easier, and management of
> the
> >    config is also made easier.
> >    3. More details are included in the test section.
> >
> > One thing that still misses is the performance numbers. I will get it
> > ready with our internal clusters and share out soon.
> >
> > Many thanks for the review!
> > Qichao
> >
> > On Tue, Sep 12, 2023 at 8:31 AM Kirk True <k...@kirktrue.pro> wrote:
> >
> >> Oh, and does metrics.partition.level.reporting.topics allow for regex?
> >>
> >> > On Sep 12, 2023, at 8:26 AM, Kirk True <k...@kirktrue.pro> wrote:
> >> >
> >> > Hi Qichao,
> >> >
> >> > Thanks for the KIP!
> >> >
> >> > Divij—questions/comments inline...
> >> >
> >> >> On Sep 11, 2023, at 4:32 AM, Divij Vaidya <divijvaidy...@gmail.com>
> >> wrote:
> >> >>
> >> >> Thank you for the proposal Qichao.
> >> >>
> >> >> I agree with the motivation here and understand the tradeoff here
> >> >> between observability vs. increased metric dimensions (metric fan-out
> >> >> as you say in the KIP).
> >> >>
> >> >> High level comments:
> >> >>
> >> >> 1. I would urge you to consider the extensibility of the proposal for
> >> >> other types of metrics. Tomorrow, if we want to selectively add
> >> >> "partition" dimension to another metric, would we have to modify the
> >> >> code where each metric is emitted? Alternatively, could we abstract
> >> >> out this config in a "Kafka Metrics" library. The code provides all
> >> >> information about this library and this library can choose which
> >> >> dimensions it wants to add to the final metrics that are emitted
> based
> >> >> on declarative configuration.
> >> >
> >> > I’d agree with this if it doesn’t place a burden on the callers. Are
> >> there any potential call sites that don’t have the partition information
> >> readily available?
> >> >
> >> >> 2. Can we offload the handling of this dimension filtering to the
> >> >> metric framework? Have you explored whether prometheus or other
> >> >> libraries provide the ability to dynamically change dimensions
> >> >> associated with metrics?
> >> >
> >> > I’m not familiar with the downstream metrics providers’ capabilities.
> >> This is a greatest common denominator scenario, right? We’d have to be
> >> reasonable sure that the heavily used providers *all* support such
> dynamic
> >> filtering.
> >> >
> >> > Also—and correct me as needed as I’m not familiar with the area—if we
> >> relegate partition filtering to a lower layer, we’d still need to store
> the
> >> metric data in memory until it’s flushed, yes? If so, is that overhead
> of
> >> any concern?
> >> >
> >> >> Implementation level comments:
> >> >>
> >> >> 1. In the test plan section, please mention what kind of integ and/or
> >> >> unit tests will be added and what they will assert. As an example,
> you
> >> >> can add a section, "functionality tests", which would assert that new
> >> >> metric config is being respected and another section, "performance
> >> >> tests", which could be a system test and assert that no regression
> >> >> caused wrt resources occupied by metrics from one version to another.
> >> >> 2. Please mention why or why not are we considering dynamically
> >> >> setting the configuration (i.e. without a broker restart)? I would
> >> >> imagine that the ability to dynamically configure for a specific
> topic
> >> >> will be very useful especially to debug production situations that
> you
> >> >> mention in the motivation.
> >> >
> >> > +1
> >> >
> >> >> 3. You mention that we want to start with metrics closely related to
> >> >> producer & consumers first, which is fair. Could you please add a
> >> >> statement on the work required to extend this to other metrics in
> >> >> future?
> >> >
> >> > +1
> >> >
> >> >> 4. In the compatibility section, you mention that this change is
> >> >> backward compatible. I don't fully understand that. During a version
> >> >> upgrade, we will start with an empty list of topics to maintain
> >> >> backward compatibility. I assume after the upgrade, we will update
> the
> >> >> new config with topic names that we desire to monitor. But updating
> >> >> the config will require a broker restart (a rolling restart since
> >> >> config is read-only). We will be in a situation where some brokers
> are
> >> >> sending metrics with a new "partition" dimension and some brokers are
> >> >> sending metrics with no partition dimension. Is that acceptable to
> JMX
> >> >> / prometheus collectors? Would it break them? Please clarify how
> >> >> upgrades will work in the compatibility section.
> >> >> 5. Could you please quantify (with an experiment) the expected perf
> >> >> impact of adding the partition dimension? This could be done as part
> >> >> of "test plan" section and would serve as a data point for users to
> >> >> understand the potential impact if they decide to turn it on.
> >> >
> >> > Is there some guidance on the level of precision and detail expected
> >> when providing the performance numbers in the KIP?
> >> >
> >> > This notion of proving out the performance impact is important, I
> >> agree. Anecdotally, there was another KIP I was following for which
> >> performance numbers were requested, as is reasonable. But that caused
> the
> >> KIP to go a bit sideways as a result because it wasn’t able to get
> >> consensus on a) the different scenarios to test, and b) the quantitative
> >> goal for each. I’m not really sure the rigo(u)r that’s expected at this
> >> stage in the development of a new feature.
> >> >
> >> > Thanks,
> >> > Kirk
> >> >
> >> >>
> >> >> --
> >> >> Divij Vaidya
> >> >>
> >> >>
> >> >> On Sat, Sep 9, 2023 at 8:18 PM Qichao Chu <qic...@uber.com.invalid>
> >> wrote:
> >> >>>
> >> >>> Hi All,
> >> >>>
> >> >>> Although this has been discussed many times, I would like to start a
> >> new
> >> >>> discussion regarding the introduction of partition-level throughput
> >> >>> metrics. Please review the KIP and I'm eager to know everyone's
> >> thoughts:
> >> >>>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-977%3A+Partition-Level+Throughput+Metrics
> >> >>>
> >> >>> TL;DR: The KIP proposes to add partition-level throughput metrics
> and
> >> a new
> >> >>> configuration to control the fan-out rate.
> >> >>>
> >> >>> Thank you all for the review and have a nice weekend!
> >> >>>
> >> >>> Best,
> >> >>> Qichao
> >>
> >>
>

Re: [DISCUSS] KIP-977: Partition-Level Throughput Metrics

Reply via email to