Hello!

It is great to see another proposal on the same topic, but optimising for
different scenarios, so thanks a lot for the effort put in this!

I have a few questions and statements in no particular order.

If you use acks=-1 (acks=all) then an acknowledgement can only be sent to
the producer if and only if the records have been persisted in replicated
object storage (S3) or non-replicated object storage (S3E1AZ) and
downloaded on followers. If you do not do this, then you do not cover the
following two failure scenarios which Kafka does cover today:

1. Your leader persists records on disk. Your followers fetch the metadata
for these records. The high watermark on the leader advances. The leader
sends acknowledgement to the producer. The records are not yet put in
object storage. The leader crashes irrecoverably before the records are
uploaded.

2. Your leader persists records on disk. Your followers fetch the metadata
for these records. The high watermark on the leader advances. The leader
sends acknowledgement to the producer. The records are put in
non-replicated object storage, but not downloaded by followers. The
non-replicated object storage experiences prolonged unavailability. The
leader crashes irrecoverably.

In both of these scenarios you risk either data loss or data unavailability
if a single replica goes out of commission. As such, this breaks the
current definition of acks=-1 (acks=all) to the best of my knowledge. I am
happy to discuss this further if you think this is not the case.

S3E1AZ only resides in 1 availability zone. This poses the following
questions:
a) Will you have 1 bucket per availability zone assuming a 3-broker cluster
where each broker is in a separate availability zone?
b) If not, then have you ran a test on the network penalty in terms of
latency for the 2 brokers not in the same availability zone but being
leaders for their respective partitions? Here I am interested to see what
2/3 of any cluster will experience?
c) On a quick search it isn't clear whether S3E1AZ incurs cross-AZ
networking data charges (again, in the case where there is only 1 bucket
for the whole cluster). This might be my fault, but from the table at the
end of the KIP it isn't super obvious to me whether the transfer cost
includes these network charges. Have you ran a test to see whether the
pricing still makes sense? If you have could you share these numbers in the
KIP?

As far as I understand, this will work in conjunction with Tiered Storage
as it works today. Am I correct in my reading of the KIP? If I am correct,
then how you store data in active segments seems to differ from how TS
stores data in closed segments. In your proposal you put multiple
partitions in the same blob. What and how will move this data back to the
old format used by TS?

How will you handle compaction?
How will you handle indexes?
How will you handle transactions?

Once again, this is quite exciting, so thanks for the contribution!

Best,
Christo

On Thu, 1 May 2025 at 19:01, Henry Haiying Cai
<haiying_...@yahoo.com.invalid> wrote:

>  Luke,
> Thanks for your comments, see my answers below inline.
>     On Thursday, May 1, 2025 at 03:20:54 AM PDT, Luke Chen <
> show...@gmail.com> wrote:
>
>  Hi Henry,
>
> This is a very interesting proposal!
> I love the idea to minimize the code change to be able to quickly get
> delivered.
> Thanks for proposing this!
>
> Some questions:
> 1. In this KIP, we add one more tier of storage. That is: local disk ->
> fast object store -> slow object store.
> Why can't we allow users to replace the local disk with the fast object
> store directly? Any consideration on this?
> If we don't have the local disk, the follower fetch will be much simplified
> without downloading from the fast object store, is my understanding
> correct?
> HC> The fast object storage is not as fast as local disk, the data latency
> on fast object storage is going to be in 10ms for big data packets and the
> local disk append is fast since we only need to append the records into the
> page cache of the local file (the flush from page cache to disk is done
> asynchronously without affecting the main request/reply cycle between
> producer and leader broker).   This is actually the major difference
> between this KIP and KIP-1150, although KIP-1150 can completely removing
> the local disk but they are going to have a long latency (their main use
> cases is for customer can tolerate 200ms latency) and they need to start
> build their own memory management and caching strategy since they are not
> using page cache anymore.  Our KIP has no latency change (comparing the
> current Kafka status) on acks=1 path which I believe is still the operating
> mode for many company's logging pipelines.
>
> 2. Will the WALmetadata be deleted after the data in fast object storage is
> deleted?
> I'm a little worried about the metadata size in the WALmetadata. I guess
> the __remote_log_metadata topic is stored in local disk only, right?
> HC> Currently we are reusing the classes and constructs from KIP-405, e.g.
> the __remote_log_metadata topic and ConsumerManager and ProducerManager.
> As you pointed out the size of segments from active log segments is going
> to be big, our vision is to create a separate metadata topic for active log
> segments then we can have a shorter retention setting for this topic to
> remove the segment metadata faster, but we would need to refactor code in
> ConsumerManager and ProducerManager to work with 2nd metadata topic.
>
> 3. In this KIP, we assume the fast object store is different from the slow
> object store.
> Is it possible we allow users to use the same one?
> Let's say, we set both fast/slow object store = S3 (some use cases doesn't
> care about too much on the latency), if we offload the active log segment
> onto fast object store (S3), can we not offload the segment to slow object
> store again after the log segment is rolled?
> I'm thinking if it's possible we learn(borrow) some ideas from KIP-1150?
> This way, we can achieve the similar goal since we accumulate (combine)
> data in multiple partitions and upload to S3 to save the cost.
>
> HC> Of course people can choose just to use S3 for both fast and slow
> object storage.  They can have the same class implementing both
> RemoteStorageManager and RemoteWalStorageManager, we proposed
> RemoteWalStorageManager as a separate interface to give people different
> implementation choices.
> I think KIP-1176 (this one) and KIP-1150 can combine some ideas or
> implementations.  We mainly focus on cutting AZ transfer cost while
> maintaining the same performance characteristics (such as latency) and
> doing a smaller evolution of the current Kafka code base. KIP-1150 is a
> much ambitious effort with a complete revamp of Kafka storage and memory
> management system.
> Thank you.
> Luke
>
> On Thu, May 1, 2025 at 1:45 PM Henry Haiying Cai
> <haiying_...@yahoo.com.invalid> wrote:
>
> > Link to the KIP:
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1176%3A+Tiered+Storage+for+Active+Log+Segment
> > Motivation
> > In KIP-405, the community has proposed and implemented the tiered storage
> > for old Kafka log segment files, when the log segments is older than
> > local.retention.ms, it becomes eligible to be uploaded to cloud's object
> > storage and removed from the local storage thus reducing local storage
> > cost.  KIP-405 only uploads older log segments but not the most recent
> > active log segments (write-ahead logs). Thus in a typical 3-way
> replicated
> > Kafka cluster, the 2 follower brokers would still need to replicate the
> > active log segments from the leader broker. It is common practice to set
> up
> > the 3 brokers in three different AZs to improve the high availability of
> > the cluster. This would cause the replications between leader/follower
> > brokers to be across AZs which is a significant cost (various studies
> show
> > the across AZ transfer cost typically comprises 50%-60% of the total
> > cluster cost). Since all the active log segments are physically present
> on
> > three Kafka Brokers, they still comprise significant resource usage on
> the
> > brokers. The state of the broker is still quite big during node
> > replacement, leading to longer node replacement time. KIP-1150 recently
> > proposes diskless Kafka topic, but leads to increased latency and a
> > significant redesign. In comparison, this proposed KIP maintains
> identical
> > performance for acks=1 producer path, minimizes design changes to Kafka,
> > and still slashes cost by an estimated 43%.
> >
>

Reply via email to