Hi Peter and David,
Thanks for the KIP.

It contains some interesting ideas but it's very nebulous at this point. I 
think the suggestion of a proxy
layer in front of Kafka is a good way to start probing at this. Are the 
existing Kafka metrics helpful
for QoS? What changes might we make to the Kafka protocol to enable this kind 
of information to
flow back to the proxy and/or clients, kind of KIP-714 in reverse? Of course, 
you need someone to
write suitable code in a proxy to take this beyond a paper exercise.

There are also multiple implementations of the Kafka protocol these days, 
including the proxies
themselves. When you're talking to a "broker", it's not necessarily a broker at 
all. The apparent number
of brokers in the cluster you're talking to might not reflect the reality of 
the deployed resources.
As a result, I suggest that it's better to think about the QoS offered by the 
cluster and consider them
as brokerless services.

One of the future enhancements that I have in mind for share groups when we've 
completed
KIP-932 is to change consumer assignments dynamically based on partition load. 
We already have
loosened the link between partitions and consumers (at the expense of ordering, 
to be sure).
I would also like to be able to have the number of consumers change dynamically 
in appropriate
environments to scale as the workload ebbs and flows.

Thanks,
Andrew
________________________________________
From: David Kjerrumgaard <dav...@apache.org>
Sent: 13 May 2025 18:52
To: dev@kafka.apache.org <dev@kafka.apache.org>
Subject: Re: KIP-1182 Quality of Service (QoS) for Apache Kafka

Thanks for the feedback Almog. I agree that the level of effort for this 
requires several different KIPs that are all related.

For the first phase, I envision a proxy layer that sits in front of multiple 
Kafka clusters, e.g. one traditional deployment, and another diskless 
implementation. Then based on the requested QoS by the client, the proxy will 
route the client to the best cluster for that task. As part of this first 
phase, cluster expansion (if possible) would be in scope as well.  Thus, if the 
proxy determines that all of the clusters are overloaded, it can choose to 
expand an existing one by adding more brokers, or create a net new cluster 
dynamically to accommodate the anticipated load.

Phase 2 would focus on tracking the cluster and topic performance against the 
stated QoS performance metrics. Likely starting with alerts based on 
compliance, non-compliance of the agreed upon SLAs. prolonged violation of the 
SLA would trigger consumer/producer negotiation.

In a later phase we can focus on the negotiation between producer and 
consumers. This would most likely require dynamic reassignment of topics to 
clusters, e.g. shifting a topic from a diskless cluster to a disk-based one to 
accommodate a lower latency requirement by a consumer.

On 2025/05/13 15:26:01 Almog Gavra wrote:
> Thanks for the KIP Peter! Curious to see where this one goes, I think it's
> good to start a discussion around this though perhaps we'll need to split
> it up into more focused improvements as there's a lot bundled in this one
> idea!
>
> A0. I'd like to see some folk that are more familiar with the broker
> implementation to chime in around the feasibility of implementing some of
> this. AFAIK, there's no capabilities that allow (for example) shifting
> resources between topics. Isolating that from a resource allocation
> perspective may be a huge lift, though certainly a valuable one.
>
> A1. With A0 in mind, I'm wondering what the benefit for making the QoS spec
> an open standard - it depends heavily both on the broker implementation and
> on how it's deployed (containerized? bare metal? k8s?). That makes what we
> can practically offer bundled with the default implementation limited.
> OTOH, I'm not sure whether users benefit from "open standards, free of
> vendor bias as much as possible" If the specification is customizable
> enough to allow for vendor specific extensions.
>
> A2. More a technical note, but the dynamic negotiation between producer and
> consumer seems to break a key abstraction of Kafka which is decoupling
> producers from consumers. That might work well if you have one consumer,
> but if you have multiple I imagine you wouldn't want one lagging to cause
> the producer to back up.
>
> I'll be following along, I'm sure there will be some good discussions
> around this!
>
> - Almog
>
> On Mon, May 12, 2025 at 4:47 PM Peter Corless
> <peter.corl...@startree.ai.invalid> wrote:
>
> > David Kjerrumgaard and I wrote up the following KIP for Kafka Quality of
> > Service (QoS). It would be a mechanism to describe desired behaviors and
> > actual capabilities of producers, clusters and consumers, and to allow them
> > to negotiate desired throughputs, latencies, data retention, and other
> > elements of data streaming. It would also provide instrumentality for
> > observability to measure actual performance to compare to desired
> > performance.
> >
> > Would love to hear frank and thoughtful feedback, as well as committers who
> > would be interested in working on implementation.
> >
> >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1182%3A+Quality+of+Service+%28QoS%29+Framework
> >
> > --
> >
> > [image: StarTree] <https://startree.ai/>
> > Peter Corless
> > Director of Product Marketing
> > 650-906-3134
> > Follow us: [image: LinkedIn] <https://www.linkedin.com/in/petercorless/
> > >[image:
> > Twitter] <https://twitter.com/petercorless>[image: Slack]
> > <https://stree.ai/slack>[image: YouTube]
> > <https://youtube.com/StarTreeData>[image:
> > Calendly] <https://calendly.com/peter-corless/30min>
> >
> > [image: Save my spot for Real-Time Analytics Summit 2025]
> > <
> > https://rtasummit.startree.ai/?utm_source=referral&utm_medium=email&utm_campaign=signature
> > >
> >
>

Reply via email to