Answers inline:

On Wed, May 14, 2025 at 2:42 AM Andrew Schofield <
andrew_schofield_j...@outlook.com> wrote:

> Hi Peter and David,
> Thanks for the KIP.
>
> It contains some interesting ideas but it's very nebulous at this point.


True. However, we believed that until we actually put a proposal out there
that others could bat around it just could remain a nebulous idea until the
end of time. Our hope is that by sharing it we'll get the feedback to take
it out of its current nebulous gaseous state into something more fluid, and
eventually solid.


> I think the suggestion of a proxy
> layer in front of Kafka is a good way to start probing at this. Are the
> existing Kafka metrics helpful
> for QoS?


Yes. I well imagine some metrics that will be helpful as-is. Gap analysis
is needed. "Which of these can we use, which can we modify/extend for our
purposes, and which need to be invented whole cloth?"


> What changes might we make to the Kafka protocol to enable this kind of
> information to
> flow back to the proxy and/or clients, kind of KIP-714 in reverse? Of
> course, you need someone to
> write suitable code in a proxy to take this beyond a paper exercise.
>

This is a great point. In due time there needs to be a fine-toothed comb
dragged through all of the extant instrumentality and telemetry types and
see how we can use re-specific metrics and line them up with QoS requests.


> There are also multiple implementations of the Kafka protocol these days,
> including the proxies
> themselves. When you're talking to a "broker", it's not necessarily a
> broker at all. The apparent number
> of brokers in the cluster you're talking to might not reflect the reality
> of the deployed resources.
> As a result, I suggest that it's better to think about the QoS offered by
> the cluster and consider them
> as brokerless services.
>

100%. There are serverless and server-based implementations. When you have
visibility to specific brokers and servers, you might be able to make
certain requests you cannot to a serverless implementation.

Generally I keep coming back to the idea that what we propose should not
*preclude* specification down to the broker/server-level if you have access
to it. Yet I believe you're correct: generally we should abstract to the
cluster-level because that might be all that we have to go on.


> One of the future enhancements that I have in mind for share groups when
> we've completed
> KIP-932 is to change consumer assignments dynamically based on partition
> load. We already have
> loosened the link between partitions and consumers (at the expense of
> ordering, to be sure).
> I would also like to be able to have the number of consumers change
> dynamically in appropriate
> environments to scale as the workload ebbs and flows.
>

This is exactly the kind of thing that is good-to-know up-front as we go
forward from concept towards implementation. Had we come forth with a
fully-fleshed out design and only belatedly known about it, I'm sure there
would be a few "Whoops!" uttered.


> Thanks,
> Andrew
>

Thanks greatly for your consideration, Andrew. You've given me quite a bit
to think about, and a few new sidequests to undertake.

-Peter.


> ________________________________________
> From: David Kjerrumgaard <dav...@apache.org>
> Sent: 13 May 2025 18:52
> To: dev@kafka.apache.org <dev@kafka.apache.org>
> Subject: Re: KIP-1182 Quality of Service (QoS) for Apache Kafka
>
> Thanks for the feedback Almog. I agree that the level of effort for this
> requires several different KIPs that are all related.
>
> For the first phase, I envision a proxy layer that sits in front of
> multiple Kafka clusters, e.g. one traditional deployment, and another
> diskless implementation. Then based on the requested QoS by the client, the
> proxy will route the client to the best cluster for that task. As part of
> this first phase, cluster expansion (if possible) would be in scope as
> well.  Thus, if the proxy determines that all of the clusters are
> overloaded, it can choose to expand an existing one by adding more brokers,
> or create a net new cluster dynamically to accommodate the anticipated load.
>
> Phase 2 would focus on tracking the cluster and topic performance against
> the stated QoS performance metrics. Likely starting with alerts based on
> compliance, non-compliance of the agreed upon SLAs. prolonged violation of
> the SLA would trigger consumer/producer negotiation.
>
> In a later phase we can focus on the negotiation between producer and
> consumers. This would most likely require dynamic reassignment of topics to
> clusters, e.g. shifting a topic from a diskless cluster to a disk-based one
> to accommodate a lower latency requirement by a consumer.
>
> On 2025/05/13 15:26:01 Almog Gavra wrote:
> > Thanks for the KIP Peter! Curious to see where this one goes, I think
> it's
> > good to start a discussion around this though perhaps we'll need to split
> > it up into more focused improvements as there's a lot bundled in this one
> > idea!
> >
> > A0. I'd like to see some folk that are more familiar with the broker
> > implementation to chime in around the feasibility of implementing some of
> > this. AFAIK, there's no capabilities that allow (for example) shifting
> > resources between topics. Isolating that from a resource allocation
> > perspective may be a huge lift, though certainly a valuable one.
> >
> > A1. With A0 in mind, I'm wondering what the benefit for making the QoS
> spec
> > an open standard - it depends heavily both on the broker implementation
> and
> > on how it's deployed (containerized? bare metal? k8s?). That makes what
> we
> > can practically offer bundled with the default implementation limited.
> > OTOH, I'm not sure whether users benefit from "open standards, free of
> > vendor bias as much as possible" If the specification is customizable
> > enough to allow for vendor specific extensions.
> >
> > A2. More a technical note, but the dynamic negotiation between producer
> and
> > consumer seems to break a key abstraction of Kafka which is decoupling
> > producers from consumers. That might work well if you have one consumer,
> > but if you have multiple I imagine you wouldn't want one lagging to cause
> > the producer to back up.
> >
> > I'll be following along, I'm sure there will be some good discussions
> > around this!
> >
> > - Almog
> >
> > On Mon, May 12, 2025 at 4:47 PM Peter Corless
> > <peter.corl...@startree.ai.invalid> wrote:
> >
> > > David Kjerrumgaard and I wrote up the following KIP for Kafka Quality
> of
> > > Service (QoS). It would be a mechanism to describe desired behaviors
> and
> > > actual capabilities of producers, clusters and consumers, and to allow
> them
> > > to negotiate desired throughputs, latencies, data retention, and other
> > > elements of data streaming. It would also provide instrumentality for
> > > observability to measure actual performance to compare to desired
> > > performance.
> > >
> > > Would love to hear frank and thoughtful feedback, as well as
> committers who
> > > would be interested in working on implementation.
> > >
> > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1182%3A+Quality+of+Service+%28QoS%29+Framework
> > >
> > > --
> > >
> > > [image: StarTree] <https://startree.ai/>
> > > Peter Corless
> > > Director of Product Marketing
> > > 650-906-3134
> > > Follow us: [image: LinkedIn] <
> https://www.linkedin.com/in/petercorless/
> > > >[image:
> > > Twitter] <https://twitter.com/petercorless>[image: Slack]
> > > <https://stree.ai/slack>[image: YouTube]
> > > <https://youtube.com/StarTreeData>[image:
> > > Calendly] <https://calendly.com/peter-corless/30min>
> > >
> > > [image: Save my spot for Real-Time Analytics Summit 2025]
> > > <
> > >
> https://rtasummit.startree.ai/?utm_source=referral&utm_medium=email&utm_campaign=signature
> > > >
> > >
> >
>


-- 

[image: StarTree] <https://startree.ai>
Peter Corless
Director of Product Marketing
650-906-3134
Follow us: [image: LinkedIn] <https://www.linkedin.com/in/petercorless/>[image:
Twitter] <https://twitter.com/petercorless>[image: Slack]
<https://stree.ai/slack>[image: YouTube]
<https://youtube.com/StarTreeData>[image:
Calendly] <https://calendly.com/peter-corless/30min>

[image: Save my spot for Real-Time Analytics Summit 2025]
<https://rtasummit.startree.ai/?utm_source=referral&utm_medium=email&utm_campaign=signature>

Reply via email to