Hi Peter and David, Thanks for the KIP. It contains some interesting ideas but it's very nebulous at this point. I think the suggestion of a proxy layer in front of Kafka is a good way to start probing at this. Are the existing Kafka metrics helpful for QoS? What changes might we make to the Kafka protocol to enable this kind of information to flow back to the proxy and/or clients, kind of KIP-714 in reverse? Of course, you need someone to write suitable code in a proxy to take this beyond a paper exercise.
There are also multiple implementations of the Kafka protocol these days, including the proxies themselves. When you're talking to a "broker", it's not necessarily a broker at all. The apparent number of brokers in the cluster you're talking to might not reflect the reality of the deployed resources. As a result, I suggest that it's better to think about the QoS offered by the cluster and consider them as brokerless services. One of the future enhancements that I have in mind for share groups when we've completed KIP-932 is to change consumer assignments dynamically based on partition load. We already have loosened the link between partitions and consumers (at the expense of ordering, to be sure). I would also like to be able to have the number of consumers change dynamically in appropriate environments to scale as the workload ebbs and flows. Thanks, Andrew ________________________________________ From: David Kjerrumgaard <dav...@apache.org> Sent: 13 May 2025 18:52 To: dev@kafka.apache.org <dev@kafka.apache.org> Subject: Re: KIP-1182 Quality of Service (QoS) for Apache Kafka Thanks for the feedback Almog. I agree that the level of effort for this requires several different KIPs that are all related. For the first phase, I envision a proxy layer that sits in front of multiple Kafka clusters, e.g. one traditional deployment, and another diskless implementation. Then based on the requested QoS by the client, the proxy will route the client to the best cluster for that task. As part of this first phase, cluster expansion (if possible) would be in scope as well. Thus, if the proxy determines that all of the clusters are overloaded, it can choose to expand an existing one by adding more brokers, or create a net new cluster dynamically to accommodate the anticipated load. Phase 2 would focus on tracking the cluster and topic performance against the stated QoS performance metrics. Likely starting with alerts based on compliance, non-compliance of the agreed upon SLAs. prolonged violation of the SLA would trigger consumer/producer negotiation. In a later phase we can focus on the negotiation between producer and consumers. This would most likely require dynamic reassignment of topics to clusters, e.g. shifting a topic from a diskless cluster to a disk-based one to accommodate a lower latency requirement by a consumer. On 2025/05/13 15:26:01 Almog Gavra wrote: > Thanks for the KIP Peter! Curious to see where this one goes, I think it's > good to start a discussion around this though perhaps we'll need to split > it up into more focused improvements as there's a lot bundled in this one > idea! > > A0. I'd like to see some folk that are more familiar with the broker > implementation to chime in around the feasibility of implementing some of > this. AFAIK, there's no capabilities that allow (for example) shifting > resources between topics. Isolating that from a resource allocation > perspective may be a huge lift, though certainly a valuable one. > > A1. With A0 in mind, I'm wondering what the benefit for making the QoS spec > an open standard - it depends heavily both on the broker implementation and > on how it's deployed (containerized? bare metal? k8s?). That makes what we > can practically offer bundled with the default implementation limited. > OTOH, I'm not sure whether users benefit from "open standards, free of > vendor bias as much as possible" If the specification is customizable > enough to allow for vendor specific extensions. > > A2. More a technical note, but the dynamic negotiation between producer and > consumer seems to break a key abstraction of Kafka which is decoupling > producers from consumers. That might work well if you have one consumer, > but if you have multiple I imagine you wouldn't want one lagging to cause > the producer to back up. > > I'll be following along, I'm sure there will be some good discussions > around this! > > - Almog > > On Mon, May 12, 2025 at 4:47 PM Peter Corless > <peter.corl...@startree.ai.invalid> wrote: > > > David Kjerrumgaard and I wrote up the following KIP for Kafka Quality of > > Service (QoS). It would be a mechanism to describe desired behaviors and > > actual capabilities of producers, clusters and consumers, and to allow them > > to negotiate desired throughputs, latencies, data retention, and other > > elements of data streaming. It would also provide instrumentality for > > observability to measure actual performance to compare to desired > > performance. > > > > Would love to hear frank and thoughtful feedback, as well as committers who > > would be interested in working on implementation. > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1182%3A+Quality+of+Service+%28QoS%29+Framework > > > > -- > > > > [image: StarTree] <https://startree.ai/> > > Peter Corless > > Director of Product Marketing > > 650-906-3134 > > Follow us: [image: LinkedIn] <https://www.linkedin.com/in/petercorless/ > > >[image: > > Twitter] <https://twitter.com/petercorless>[image: Slack] > > <https://stree.ai/slack>[image: YouTube] > > <https://youtube.com/StarTreeData>[image: > > Calendly] <https://calendly.com/peter-corless/30min> > > > > [image: Save my spot for Real-Time Analytics Summit 2025] > > < > > https://rtasummit.startree.ai/?utm_source=referral&utm_medium=email&utm_campaign=signature > > > > > >