Re: [DISCUSS] KIP-1150 Diskless Topics

Jorge Esteban Quilcate Otoya Wed, 28 May 2025 01:24:30 -0700

Thanks Colin! As one of the co-authors, here are the team responses:

> I think there's a bit of confusion in the motivation and naming here. As
Jun said, what's being proposed here is not truly "diskless" -- we're still
storing a fair amount of metadata on local disks.


We propose to address the naming question after all the other technical
questions have been resolved to have a clear understanding of what the name
applies to (in case there are significant changes).

> The proposal talks about "Unification/Relationship with Tiered Storage:
Identifying a long-term vision for Diskless and Tiered Storage plugins" as
"future work." But it seems like when we're adding a new feature, we should
consider how it interacts with existing features before we add it, not
after it's already in place.

We are reconsidering this and working on proposing an integration with
Tiered Storage as part of the KIP.
We’ll share it here once the KIP is updated.

> As it stands currently, the big advantage of KIP-1150 over the
traditional tiered storage is that with KIP-1150, you don't have to send
most of your data through normal Kafka replication. This, in turn, is
mainly about saving costs on clouds where replication is expensive.

We are happy to see another proposal on the tackling similar challenges
from a different perspective.
Both proposals aim to avoid sending data through normal Kafka replication.
The main differentiation in our understanding is that KIP-1150 is proposing
a leaderless design where all client cross-AZ costs are dropped; not only
the Kafka replication costs.
Also, KIP-1150 is proposing to use generally available object storage
options as regional S3; while KIP-1176 is currently dependent on S3
Express, which is not broadly available.

> A. What kind of latencies should we expect here? It seems like we're both
buffering lots of produce requests, and waiting until they're written to s3.

Append latency is broadly composed by:
- Buffering: up to 250ms or 8MiB (both configurable)
- Upload to Remote Storage: P99 ~200-400ms, P50 ~100ms
- Batch Coordinates Commit: P99 ~20-50ms, P50 ~10ms (depending on the
number of batches)

We are aiming for a Produce request latency of P50 ~500ms P99 ~1-2 sec
Adding a note about this on KIP-1163.

> B. Could we do something similar with KIP-1176 by not ack'ing the
ProduceRequest until the tiering had caught up to what we produced? This
will have higher latency, but maybe not higher than KIP-1150 (see point A).
If we could do that then maybe the cost advantage of KIP-1150 disappears,
since I could put all the replicas of my topic in one AZ, and ensure
durability by waiting for s3.

KIP-1176, as it’s proposed now, requires 3 replicas to support durability
across 3 AZs. They have mentioned the opportunity to acknowledge data right
after it’s written to S3 and don’t wait for normal replication to other
zones. This is an interesting alternative to reduce latency; though there’s
still a chance that data becomes unavailable if the AZ-based remote storage
becomes unreachable. We’ve suggested this option to not be called acks=all
but something else.
KIP-1150 does rely on regional S3, which increases availability and
durability; and acknowledges producers when data is written to remote
storage and offsets are committed on the batch coordinator.

> Another piece of feedback I would give is that I do not think the batch
coordinator should be pluggable.

Similar to KIP-405, we are proposing the component to be pluggable but to
include the default implementation–which will be the only one to be
maintained by the project.
Having a plugin interface doesn’t necessarily make the system harder to
evolve as it can be defined as `Evolving` while the KIP-1164 implementation
matures, leaving room for improvement.
By providing a default topic-based batch coordinator, the benefits will
apply to most Kafka deployments; but they will be bound to Kafka
constraints (e.g. stretched within a single region).
Diskless may have the potential to unlock deployment models where the batch
metadata may need to be consistent across regions or globally and using
external systems—needing a separate implementation.

> For "Compatibility, Deprecation, and Migration Plan," we just have some
text saying that this feature didn't exist before, and now it will. But
this isn't very helpful.

Fair point; though this should be covered within the sub-KIPs. Will look
into expanding those.

Thanks,
Jorge.


On Wed, 14 May 2025 at 00:44, Colin McCabe <[email protected]> wrote:

> Hi Josep,
>
> Thanks for the KIP.
>
> I think there's a bit of confusion in the motivation and naming here. As
> Jun said, what's being proposed here is not truly "diskless" -- we're still
> storing a fair amount of metadata on local disks.
>
> The proposal talks about "Unification/Relationship with Tiered Storage:
> Identifying a long-term vision for Diskless and Tiered Storage plugins" as
> "future work." But it seems like when we're adding a new feature, we should
> consider how it interacts with existing features before we add it, not
> after it's already in place.
>
> To that end, it's useful to compare this KIP against KIP-1176: Tiered
> Storage for Active Log Segment. In their current forms, both KIP-1176 and
> KIP-1150 require small disks on each broker. Traditional Kafka tiered
> storage essentially lets us treat s3 (or other blobstore) as cold storage
> for older data. KIP-1176 is essentially a refinement of that model that
> allows us to tier the active log segments as well.
>
> As it stands currently, the big advantage of KIP-1150 over the traditional
> tiered storage is that with KIP-1150, you don't have to send most of your
> data through normal Kafka replication. This, in turn, is mainly about
> saving costs on clouds where replication is expensive.
>
> When I read KIP-1163, I see the following:
>
> > 1. Producers send Produce requests to any broker.
> > 2. The broker accumulates Produce requests in a buffer until exceeding
> some size or time limit.
> > 3. When enough data accumulates or the timeout elapses, the Broker
> creates a shared log segment and batch
> >  coordinates for all of the buffered batches.
> > 4. The shared log segment is uploaded to object storage and is written
> durably.
> > 5. The broker commits the batch coordinates with the Batch Coordinator
> (described in details in KIP-1164).
> > 6. The Batch Coordinator assigns offsets to the written batches,
> persists the batch coordinates, and responds
> >  to the Broker.
> > 7. The broker sends responses to all Produce requests that are
> associated with the committed object.
>
> To me this raises a few questions:
>
> A. What kind of latencies should we expect here? It seems like we're both
> buffering lots of produce requests, and waiting until they're written to s3.
>
> B. Could we do something similar with KIP-1176 by not ack'ing the
> ProduceRequest until the tiering had caught up to what we produced? This
> will have higher latency, but maybe not higher than KIP-1150 (see point A).
> If we could do that then maybe the cost advantage of KIP-1150 disappears,
> since I could put all the replicas of my topic in one AZ, and ensure
> durability by waiting for s3.
>
> Another piece of feedback I would give is that I do not think the batch
> coordinator should be pluggable. Since this is a central part of the
> system, we should try to focus our efforts on designing a single good one,
> rather than having lots of pluggable ones. Making this pluggable also will
> make it difficult to evolve the system in the future. We should present a
> compelling use-case for pluaggability before introducing it. (In the case
> of supporting all the different blobstores, the need for pluggability is
> obvious, of course.)
>
> For "Compatibility, Deprecation, and Migration Plan," we just have some
> text saying that this feature didn't exist before, and now it will. But
> this isn't very helpful. Instead, we should try to spell out what parts of
> the system will come with compatibility guarantees. For example, will the
> format in which we write data to s3 (or other blobstore) be stable and
> documented, so that 3rd party tools can work with it? Or will we keep it
> internal and unstable?
>
> best,
> Colin
>
>
> On Wed, Apr 16, 2025, at 04:58, Josep Prat wrote:
> > Hi Kafka Devs!
> >
> > We want to start a new KIP discussion about introducing a new type of
> > topics that would make use of Object Storage as the primary source of
> > storage. However, as this KIP is big we decided to split it into multiple
> > related KIPs.
> > We have the motivational KIP-1150 (
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1150%3A+Diskless+Topics
> )
> > that aims to discuss if Apache Kafka should aim to have this type of
> > feature at all. This KIP doesn't go onto details on how to implement it.
> > This follows the same approach used when we discussed KRaft.
> >
> > But as we know that it is sometimes really hard to discuss on that meta
> > level, we also created several sub-kips (linked in KIP-1150) that offer
> an
> > implementation of this feature.
> >
> > We kindly ask you to use the proper DISCUSS threads for each type of
> > concern and keep this one to discuss whether Apache Kafka wants to have
> > this feature or not.
> >
> > Thanks in advance on behalf of all the authors of this KIP.
> >
> > ------------------
> > Josep Prat
> > Open Source Engineering Director, Aiven
> > [email protected]   |   +491715557497 | aiven.io
> > Aiven Deutschland GmbH
> > Alexanderufer 3-7, 10117 Berlin
> > Geschäftsführer: Oskari Saarenmaa, Hannu Valtonen,
> > Anna Richardson, Kenneth Chen
> > Amtsgericht Charlottenburg, HRB 209739 B
>

Re: [DISCUSS] KIP-1150 Diskless Topics

Reply via email to