RE: Re: [DISCUSS] KIP-1150 Diskless Topics

Roger Siegenthaler Mon, 15 Dec 2025 04:23:43 -0800

Hi Greg

This is my first reply to an apache mailing list, here's hoping it works as
expected.


I (and my employer) are looking forward to seeing diskless land in Kafka
and see it as the natural development of tiered storage to a more
full-featured "cloud-native" kafka. Especially the incorporation of the TS
remote storage providers and aligning with the iceberg format capabilities
in the 2.0 KIP are amazing.

However there is some worry that the timeline will slip toward 2+ years for
this, as we experienced with TS. Recommending primarily open-source Kafka
solutions (like Aiven or normal confluent) to our customers became quite
difficult as proprietary alternatives started providing similar
functionality (n.b. essentially only hyperscalers came up as providers -
think Azure events or MSK). Hopefully, this anecdote will help ensure that
resources can be provided internally.

Kind Regards,
Roger Siegenthaler

On 2025/11/13 18:51:02 Greg Harris wrote:
> Hi all,
>
> There was a video call between myself, Ivan Yurchenko, Jun Rao, and Andrew
> Schofield pertaining to KIP-1150. Here are the notes from that meeting:
>
> Ivan: What is the future state of Kafka in this area, in 5 years?
> Jun: Do we want something more cloud native? Yes, started with Tiered
> Storage. If there’s a better way, we should explore it. In the long term
> this will be useful
> Because Kafka is used so widely, we need to make sure everything we add is
> for the long term and for everyone, not just for a single company.
> When we add TS, it doesn’t just solve Uber’s use-case. We want something
> that’s high quality/lasts/maintainable, and can work with all existing
> capabilities.
> If both 1150 and 1176 proceed at the same time, it’s confusing. They
> overlap, but Diskless is more ambitious.
> If both KIPs are being seriously worked on, then we don’t really need
both,
> because Diskless clearly is better. Having multiple will confuse people.
It
> will duplicate some of the effort.
> If we want diskless ultimately, what is the short term strategy, to get
> some early wins first?
> Ivan: Andrew, do you want a more revolutionary approach?
> Andrew: Eventually the architecture will change substantially, it may not
> be necessary to put all of that bill onto Diskless at once.
> Greg: We all agree on having a high quality feature merged upstream, and
> supporting all APIs
> Jun: We should try and keep things simple, but there is some minimum
> complexity needed.
> When doing the short term changes (1176), it doesn’t really progress in
> changing to a more modern architecture.
> Greg: Was TS+Compaction the only feature miss we’ve had so far?
> Jun: The danger of only applying changes to some part of the API, you set
> the precedence that you only have to implement part of the API. Supporting
> the full API set should be a minimum requirement.
> Andrew: When we started Kraft, how much did we know the design?
> Jun: For Kraft we didn’t really know much about the migration, but the
> high-level was clear.
> Greg: Is 1150 votable in its current state?
> Jun: 1150 should promise to support all APIs. It doesn’t have to have all
> the details/apis/etc. KIP-500 didn’t have it.
> We do need some high-level design enough to give confidence that the
> promise is able to be fulfilled.
> Greg: Is the draft version in 1163 enough detail or is more needed?
> Jun: We need to agree on the core design, such as leaderless etc. And how
> the APIs will be supported.
> Greg: Okay we can include these things, and provide a sketch of how the
> other leader-based features operate.
> Jun: Yeah if at a high level the sketch appears to work, we can approve
> that functionality.
> Are you committed to doing the more involved and big project?
> Greg: Yes, we’re committed to the 1163 design and can’t really accept
1176.
> Jun: TS was slow because of Uber resourcing problems
> Greg: We’ll push internally for resources, and use the community sentiment
> to motivate Aiven.
> How far into the future should we look? What sort of scale?
> Jun: As long as there’s a path forward, and we’re not closing off future
> improvements, we can figure out how to handle a larger scale when it
arises.
> Greg: Random replica placement is very harmful, can we recommend users to
> use an external tool like CruiseControl?
> Jun: Not everyone uses CruiseControl, we would probably need some solution
> for this out of the box
> Ivan: Should the Batch Coordinator be pluggable?
> Jun: Out-of-box experience should be good, good to allow other
> implementations
> Greg: But it could hurt Kafka feature/upgrade velocity when we wait for
> plugin providers to implement it
> Ivan: We imagined that maybe cloud hyperscalers could implement it with
> e.g. dynamodb
> Greg: Could we bake more details of the different providers into Kafka, or
> does it still make sense for it to be pluggable?
> Jun: Make it whatever is easiest to roll out and add new clients
> Andrew: What happens next? Do you want to get KIP-1150 voted?
> Ivan: The vote is already open, we’re not too pressed for time. We’ll go
> improve the 1163 design and communication.
> Is 1176 a competing design? Someone will ask.
> Jun: If we are seriously working on something more ambitious, yeah we
> shouldn’t do the stop-gap solution.
> It’s diverting review resources. If we can get the short term thing in 1yr
> but Diskless solution is 2y it makes sense to go for Diskless. If it’s
5yr,
> that’s different and maybe the stop-gap solution is needed.
> Greg: I’m biased but I believe we’re in the 1yr/2yr case. Should we
> explicitly exclude 1176?
> Andrew: Put your arms around the feature set you actually want, and use
> that to rule out 1176.
> Probably don’t need -1 votes, most likely KIPs just don’t receive votes.
> Ivan: Should we have sync meetings like tiered storage did?
> Jun: Satish posted meeting notes regularly, we should do the same.
>
> To summarize, we will be polishing the contents of 1150 & high level
design
> in 1163 to prepare for a vote.
> We believe that the community should select the feature set of 1150 to
> fully eliminate producer cross-zone costs, and make the investment in a
> high quality Diskless Topics implementation rather than in stop-gap
> solutions.
>
> Thanks,
> Greg
>
> On Fri, Nov 7, 2025 at 9:19 PM Max fortun <[email protected]> wrote:
>
> > This may be a tangent, but we needed to offload storage off of Kafka
into
> > S3. We are keeping Kafka not as a source of truth, but as a mostly
> > ephemeral broker that can come and go as it pleases. Be that scaling or
> > outage. Disks can be destroyed and recreated at will, we still retain
data
> > and use broker for just that, brokering messages. Not only that, we
reduced
> > the requirement on the actual Kafka resources by reducing the size of a
> > payload via a claim check pattern. Maybe this is an anti–pattern, but
it is
> > super fast and highly cost efficient. We reworked ProducerRequest to
allow
> > plugins. We added a custom http plugin that submits every request via a
> > persisted connection to a microservice. Microservice stores the payload
and
> > returns a tiny json metadata object,a claim check, that can be used to
find
> > the actual data. Think of it as zipping the payload. This claim check
> > metadata traverses the pipelines with consumers using the urls in
metadata
> > to pull what they need. Think unzipping. This allowed us to also pull
ONLY
> > the data that we need in graphql like manner. So if you have a 100K json
> > payload and you need only a subsection, you can pull that by jmespath.
When
> > you have multiple consumer groups yanking down huge payloads it is
> > cumbersome on the broker. When you have the same consumer groups yanking
> > down a claim check, and then going out of band directly to the source of
> > truth, the broker has some breathing room. Obviously our microservice
does
> > not go directly to the cloud storage, as that would be too slow. It
stores
> > the payload in high speed memory cache and returns asap. That memory is
> > eventually persisted into S3. The retrieval goest against the cache
first,
> > then against the S3. Overall a rather cheappy and zippy solution. I
tried
> > proposing the KIP for this, but there was no excitement. Check this out:
> >
> >
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=318606528
> >
> >
> > > On Nov 7, 2025, at 5:49 PM, Jun Rao <[email protected]>
wrote:
> > >
> > > Hi, Andrew,
> > >
> > > If we want to focus only on reducing cross-zone replication costs,
there
> > is
> > > an alternative design in the KIP-1176 discussion thread that seems
> > simpler
> > > than the proposal here. I am copying the outline of that approach
below.
> > >
> > > 1. A new leader is elected.
> > > 2. Leader maintains a first tiered offset, which is initialized to log
> > end
> > > offset.
> > > 3. Leader writes produced data from the client to local log.
> > > 4. Leader uploads produced data from all local logs as a combined
object
> > > 5. Leader stores the metadata for the combined object in memory.
> > > 6. If a follower fetch request has an offset >= first tiered offset,
the
> > > metadata for the corresponding combined object is returned. Otherwise,
> > the
> > > local data is returned.
> > > 7. Leader periodically advances first tiered offset.
> > >
> > > It's still a bit unnatural, but it could work.
> > >
> > > Hi, Ivan,
> > >
> > > Are you still committed to proceeding with the original design of
> > KIP-1150?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Sun, Nov 2, 2025 at 6:00 AM Andrew Schofield <
> > [email protected]>
> > > wrote:
> > >
> > >> Hi,
> > >> I’ve been following KIP-1150 and friends for a while. I’m going to
jump
> > >> into the discussions too.
> > >>
> > >> Looking back at Jack Vanlightly’s message, I am not quite so
convinced
> > >> that it’s a kind of fork in the road. The primary aim of the effort
is
> > to
> > >> reduce cross-zone replication costs so Apache Kafka is not
prohibitively
> > >> expensive to use on cloud storage. I think it would be entirely
viable
> > to
> > >> prioritise code reuse for an initial implementation of diskless
topics,
> > and
> > >> we could still have a more cloud-native design in the future. It’s
hard
> > to
> > >> predict what the community will prioritise in the future.
> > >>
> > >> Of the three major revisions, I’m in the rev3 camp. We can support
> > >> leaderless produce requests, first writing WAL segments into object
> > >> storage, and then using the regular partition leaders to sequence the
> > >> records. The active log segment for a diskless topic will initially
> > contain
> > >> batch coordinates rather than record batches. The batch coordinates
can
> > be
> > >> resolved from WAL segments for consumers, and also in order to
prepare
> > log
> > >> segments for uploading to tiered storage. Jun is probably correct
that
> > we
> > >> need a more frequent object merging process than tiered storage
> > provides.
> > >> This is just the transition from write-optimised WAL segments to
> > >> read-optimised tiered segments, and all of the object storage-based
> > >> implementations of Kafka that I’m aware of do this rearrangement. But
> > >> perhaps this more frequent object merging is a pre-GA improvement,
> > rather
> > >> than a strict requirement for an initial implementation for early
access
> > >> use.
> > >>
> > >> For zone-aligned share consumers, the share group assignor is
intended
> > to
> > >> be rack-aware. Consumers should be assigned to partitions with
leaders
> > in
> > >> their zone. The simple assignor is not rack-aware, but it easily
could
> > be
> > >> or we could have a rack-aware assignor.
> > >>
> > >> Thanks,
> > >> Andrew
> > >>
> > >>
> > >>> On 24 Oct 2025, at 23:14, Jun Rao <[email protected]>
wrote:
> > >>>
> > >>> Hi, Ivan,
> > >>>
> > >>> Thanks for the reply.
> > >>>
> > >>> "As I understand, you’re speaking about locally materialized
segments.
> > >> They
> > >>> will indeed consume some IOPS. See them as a cache that could
always be
> > >>> restored from the remote storage. While it’s not ideal, it's still
OK
> > to
> > >>> lose data in them due to a machine crash, for example. Because of
this,
> > >> we
> > >>> can avoid explicit flushing on local materialized segments at all
and
> > let
> > >>> the file system and page cache figure out when to flush optimally.
This
> > >>> would not eliminate the extra IOPS, but should reduce it
dramatically,
> > >>> depending on throughput for each partition. We, of course, continue
> > >>> flushing the metadata segments as before."
> > >>>
> > >>> If we have a mix of classic and diskless topics on the same broker,
> > it's
> > >>> important that the classic topics' data is flushed to disk as
quickly
> > as
> > >>> possible. To achieve this, users typically set
dirty_expire_centisecs
> > in
> > >>> the kernel based on the number of available disk IOPS. Once you set
> > this
> > >>> number, it applies to all dirty files, including the cached data in
> > >>> diskless topics. So, if there are more files actively accumulating
> > data,
> > >>> the flush frequency and therefore RPO is reduced for classic topics.
> > >>>
> > >>> "We should have mentioned this explicitly, but this step, in fact,
> > >> remains
> > >>> in the form of segments offloading to tiered storage. When we
assemble
> > a
> > >>> segment and hand it over to RemoteLogManager, we’re effectively
doing
> > >>> metadata compaction: replacing a big number of pieces of metadata
about
> > >>> individual batches with a single record in __remote_log_metadata."
> > >>>
> > >>> The object merging in tier storage typically only kicks in after a
few
> > >>> hours. The impact is (1) the amount of accumulated metadata is still
> > >> quite
> > >>> large; (2) there are many small objects, leading to poor read
> > >> performance.
> > >>> I think we need a more frequent object merging process than tier
> > storage
> > >>> provides.
> > >>>
> > >>> Jun
> > >>>
> > >>>
> > >>> On Thu, Oct 23, 2025 at 10:12 AM Ivan Yurchenko <[email protected]>
> > wrote:
> > >>>
> > >>>> Hello Jack, Jun, Luke, and all!
> > >>>>
> > >>>> Thank you for your messages.
> > >>>>
> > >>>> Let me first address some of Jun’s comments.
> > >>>>
> > >>>>> First, it degrades the durability.
> > >>>>> For each partition, now there are two files being actively written
> > at a
> > >>>>> given point of time, one for the data and another for the
metadata.
> > >>>>> Flushing each file requires a separate IO. If the disk has 1K IOPS
> > and
> > >> we
> > >>>>> have 5K partitions in a broker, currently we can afford to flush
each
> > >>>>> partition every 5 seconds, achieving an RPO of 5 seconds. If we
> > double
> > >>>> the
> > >>>>> number of files per partition, we can only flush each partition
every
> > >> 10
> > >>>>> seconds, which makes RPO twice as bad.
> > >>>>
> > >>>> As I understand, you’re speaking about locally materialized
segments.
> > >> They
> > >>>> will indeed consume some IOPS. See them as a cache that could
always
> > be
> > >>>> restored from the remote storage. While it’s not ideal, it's still
OK
> > to
> > >>>> lose data in them due to a machine crash, for example. Because of
> > this,
> > >> we
> > >>>> can avoid explicit flushing on local materialized segments at all
and
> > >> let
> > >>>> the file system and page cache figure out when to flush optimally.
> > This
> > >>>> would not eliminate the extra IOPS, but should reduce it
dramatically,
> > >>>> depending on throughput for each partition. We, of course, continue
> > >>>> flushing the metadata segments as before.
> > >>>>
> > >>>> It’s worth making a note on caching. I think nobody will disagree
that
> > >>>> doing direct reads from remote storage every time a batch is
requested
> > >> by a
> > >>>> consumer will not be practical neither from the performance nor
from
> > the
> > >>>> economy point of view. We need a way to keep the number of GET
> > requests
> > >>>> down. There are multiple options, for example:
> > >>>> 1. Rack-aware distributed in-memory caching.
> > >>>> 2. Local in-memory caching. Comes with less network chattiness and
> > >> works
> > >>>> well if we have more or less stable brokers to consume from.
> > >>>> 3. Materialization of diskless logs on local disk. Way lower
impact on
> > >>>> RAM and also requires stable brokers for consumption (using just
> > >> assigned
> > >>>> replicas will probably work well).
> > >>>>
> > >>>> Materialization is one of possible options, but we can choose
another
> > >> one.
> > >>>> However, we will have this dilemma regardless of whether we have an
> > >>>> explicit coordinator or we go “coordinator-less”.
> > >>>>
> > >>>>> Second, if we ever need this
> > >>>>> metadata somewhere else, say in the WAL file manager, the consumer
> > >> needs
> > >>>> to
> > >>>>> subscribe to ev
[message truncated...]

RE: Re: [DISCUSS] KIP-1150 Diskless Topics

Reply via email to