Re: Re: [DISCUSS] KIP-1241: Reduce tiered storage redundancy with delayed upload

jian fu Tue, 02 Dec 2025 04:28:40 -0800

Hi Kamal:

I think I understand what you mean now. I’ve updated the picture in the
link(https://github.com/apache/kafka/pull/20913#issuecomment-3601274230) .
Could you help double-check whether we’ve reached the same understanding?
In short. the drawback of this KIP is that, during a long time remote
storage outage. it will occupied more disk. The max value is the redundant
part we saving.
Thus. After the outage recovered. It will come back to the beginning.
Pls help to correct me if my understanding is wrong!  Thanks again.


Regards
Jian

Kamal Chandraprakash <[email protected]> 于2025年12月2日周二 19:29写道：

> The already uploaded segments are eligible for deletion from the broker.
> So, when remote storage is down,
> then those segments can be deleted as per the local retention settings and
> new segments can occupy those spaces.
> This provides more time for the Admin to act when remote storage is down
> for a longer time.
>
> This is from a reliability perspective.
>
> On Tue, Dec 2, 2025 at 4:47 PM jian fu <[email protected]> wrote:
>
> > Hi Kamal and Haiying Cai:
> >
> > maybe you notice that my kafka clusters set 1day local + 3 days-7 days
> > remote. thus  Haiying Cai‘s configure is 3 hours local + 3 days remote.
> >
> > I can explain more about my configure.
> > I try to avoid the latency for some delay consumer to access the remote.
> > Maybe some applications may encounter some unexpected issue. but we need
> to
> > give enough time to handle it. In the period, we don't want the consumer
> to
> > access the remote to hurt the whole kafka clusters. So one day is our
> > expectation.
> >
> > I  saw one statement in  Haiying Cai  KIP1248:
> > " Currently, when a new consumer or a fallen-off consumer requires
> fetching
> > messages from a while ago, and those messages are no longer present in
> the
> > Kafka broker's local storage, the broker must download the message from
> the
> > remote tiered storage and subsequently transfer the data back to the
> > consumer.   "
> > Extend the local retention time is how we try to avoid the issue  (Here,
> we
> > don't consider the case one new consumer use the earliest strategy to
> > consume. it is not often happen in our cases.)
> >
> > So. based my configure. I will see there is one day's duplicated segment
> > wasting in remote storage. Thus I don't use them for real time analyst or
> > care about the fast reboot or some thing else.  So propose this KIP to
> take
> > one topic level optional feature to help us to reduce waste and save
> money.
> >
> > Regards
> > Jian
> >
> > jian fu <[email protected]> 于2025年12月2日周二 18:42写道：
> >
> > > Hi  Kamal:
> > >
> > > Thanks for joining this discussion. Let me try to classify my
> understands
> > > for your good questions:
> > >
> > > 1  Kamal : Do you also have to update the RemoteCopy lag segments and
> > > bytes metric?
> > >     Jian:  The code just delay the upload time for local segment. So it
> > > seems there is no need to change any lag segments or metrics. right?
> > >
> > > 2   Kamal :  As Haiying mentioned, the segments get eventually uploaded
> > to
> > > remote so not sure about the
> > > benefit of this proposal. And, remote storage cost is considered as low
> > > when compared to broker local-disk.
> > >      Jian: The cost benefit is about the total size for occupied. Take
> > AWS
> > > S3 as example. Tiered price for: 1 GB is 0.02 USD (You can refer to
> > > https://calculator.aws/#/createCalculator/S3).
> > >   It is cheaper than local disk. So as I mentioned that the saving
> money
> > > depend on the ratio local vs remote retention time.  If your set the
> > remote
> > > storage time as a long time. The benefit is few, It is just avoiding
> the
> > > waste instead of cost saving.
> > >   So I take it as topic level optional configure instead of default
> > > feature.
> > >
> > > 3  Kamal:   It provides some cushion during third-party object storage
> > > downtime.
> > >      Jian:   I draw one picture to try to under the logic(
> > > https://github.com/apache/kafka/pull/20913#issuecomment-3601274230).
> You
> > > can help to check if my understanding is right. I seemed that no
> > difference
> > > for them. So for this question. maybe we need to discuss more about it.
> > The
> > > only difference maybe we may increase a little local disk for temp due
> to
> > > the delay for upload remote. So in the original proposal. I want to
> > upload
> > > N-1 segments. But it seems the value is not much.
> > >
> > > BTW. I want to classify one basic rule: this feature isn't to change
> the
> > > default behavior. and the saving amount is not very big value in all
> > cases.
> > > It is suitable for part of topic which set a low ratio for remote/local
> > > such as 7days/1days or 3days/1day
> > > At the last. Thanks again for your time and your comments. All the
> > > questions are valid and good for us to thing more about it.
> > >
> > > Regards
> > > Jian
> > >
> > >
> > > Kamal Chandraprakash <[email protected]> 于2025年12月2日周二
> > > 17:41写道：
> > >
> > >> 1. Do you also have to update the RemoteCopy lag segments and bytes
> > >> metric?
> > >> 2. As Haiying mentioned, the segments get eventually uploaded to
> remote
> > so
> > >> not sure about the
> > >> benefit of this proposal. And, remote storage cost is considered as
> low
> > >> when compared to broker local-disk.
> > >> It provides some cushion during third-party object storage downtime.
> > >>
> > >>
> > >> On Tue, Dec 2, 2025 at 2:45 PM Kamal Chandraprakash <
> > >> [email protected]> wrote:
> > >>
> > >> > Hi Jian,
> > >> >
> > >> > Thanks for the KIP!
> > >> >
> > >> > When remote storage is unavailable for a few hrs, then with lazy
> > upload
> > >> > there is a risk of the broker disk getting full soon.
> > >> > The Admin has to configure the local retention configs properly.
> With
> > >> > eager upload, the disk utilization won't grow
> > >> > until the local retention time (expectation is that all the
> > >> > passive segments are uploaded). And, provides some time
> > >> > for the Admin to take any action based on the situation.
> > >> >
> > >> > --
> > >> > Kamal
> > >> >
> > >> > On Tue, Dec 2, 2025 at 10:28 AM Haiying Cai via dev <
> > >> [email protected]>
> > >> > wrote:
> > >> >
> > >> >> Jian,
> > >> >>
> > >> >> Understands this is an optional feature and the cost saving depends
> > on
> > >> >> the ratio between local.retention.ms and total retention.ms.
> > >> >>
> > >> >> In our setup, we have local.retention set to 3 hours and total
> > >> retention
> > >> >> set to 3 days, so the saving is not going to be significant.
> > >> >>
> > >> >> On 2025/12/01 05:33:11 jian fu wrote:
> > >> >> > Hi Haiying Cai,
> > >> >> >
> > >> >> > Thanks for joining the discussion for this KIP. All of your
> > concerns
> > >> are
> > >> >> > valid, and that is exactly why I introduced a topic-level
> > >> configuration
> > >> >> to
> > >> >> > make this feature optional. This means that, by default, the
> > behavior
> > >> >> > remains unchanged. Only when users are not pursuing faster broker
> > >> boot
> > >> >> time
> > >> >> > or other optimizations — and care more about cost — would they
> > enable
> > >> >> this
> > >> >> > option to some topics to save resources.
> > >> >> >
> > >> >> > Regarding cost self: the actual savings depend on the ratio
> between
> > >> >> local
> > >> >> > retention and remote retention. In the KIP/PR, I provided a test
> > >> >> example:
> > >> >> > if we configure 1 day of local retention and 2 days of remote
> > >> >> retention, we
> > >> >> > can save about 50%. And realistically, I don't think anyone would
> > >> boldly
> > >> >> > set local retention to a very small value (such as minutes) due
> to
> > >> the
> > >> >> > latency concerns associated with remote storage. So in short, the
> > >> >> feature
> > >> >> > will help reduce cost, and the amount saved simply depends on the
> > >> ratio.
> > >> >> > Take my company's usage as real example, we configure most of the
> > >> >> topics: 1
> > >> >> > day of local retention and 3–7 days of remote storage (3 days for
> > >> topic
> > >> >> > with log/metric usage, 7 days for topic with normal business
> > usage).
> > >> >> and we
> > >> >> > don't care about the boot speed and some thing else, This KIP
> > allows
> > >> us
> > >> >> to
> > >> >> > save 1/7 to 1/3 of the total disk usage for remote storage.
> > >> >> >
> > >> >> > Anyway, this is just a topic-level optional feature which don't
> > >> reject
> > >> >> the
> > >> >> > benifit for current design. Thanks again for the discussion. I
> can
> > >> >> update
> > >> >> > the KIP to better classify scenarios where this optional feature
> is
> > >> not
> > >> >> > suitable. Currently, I only listed real-time analytics as the
> > >> negative
> > >> >> > example.
> > >> >> >
> > >> >> > Welcome further discussion to help make this KIP more complete.
> > >> Thanks!
> > >> >> >
> > >> >> > Regards,
> > >> >> > Jian
> > >> >> >
> > >> >> > Haiying Cai via dev <[email protected]> 于2025年12月1日周一
> > 12:40写道：
> > >> >> >
> > >> >> > > Jian,
> > >> >> > >
> > >> >> > > Thanks for the contribution.  But I feel the uploading the
> local
> > >> >> segment
> > >> >> > > file to remote storage ASAP is advantageous in several
> scenarios:
> > >> >> > >
> > >> >> > > 1. Enable the fast bootstrapping a new broker.  A new broker
> > >> doesn’t
> > >> >> have
> > >> >> > > to replicate all the data from the leader broker, it only needs
> > to
> > >> >> > > replicate the data from the tail of the remote log segment to
> the
> > >> >> tail of
> > >> >> > > the current end of the topic (LSO) since all the other data are
> > in
> > >> the
> > >> >> > > remote tiered storage and it can download them later lazily,
> this
> > >> is
> > >> >> what
> > >> >> > > KIP-1023 trying to solve;
> > >> >> > > 2. Although nobody has proposed a KIP to allow a consumer
> client
> > to
> > >> >> read
> > >> >> > > from the remote tiered storage directly, but this will helps
> the
> > >> >> > > fall-behind consumer to do catch-up reads or perform the
> > backfill.
> > >> >> This
> > >> >> > > path allows the consumer backfill to finish without polluting
> the
> > >> >> broker’s
> > >> >> > > page cache.  The earlier the data is on the remote tiered
> > storage,
> > >> >> the more
> > >> >> > > advantageous it is for the client.
> > >> >> > >
> > >> >> > > I think in your Proposal, you are delaying uploading the
> segment
> > >> but
> > >> >> the
> > >> >> > > file will still be uploaded at a later time, I guess this can
> > >> saves a
> > >> >> few
> > >> >> > > hours storage cost for that file in the remote storage, not
> sure
> > >> >> whether
> > >> >> > > that is a significant cost saved (if the file needs to stay in
> > >> remote
> > >> >> > > tiered storage for several days or weeks due to retention
> > policy).
> > >> >> > >
> > >> >> > > On 2025/11/19 13:29:11 jian fu wrote:
> > >> >> > > > Hi everyone, I'd like to start a discussion on KIP-1241, the
> > goal
> > >> >> is to
> > >> >> > > > reduce the remote storage. KIP:
> > >> >> > > >
> > >> >> > >
> > >> >>
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1241%3A+Reduce+tiered+storage+redundancy+with+delayed+upload
> > >> >> > > >
> > >> >> > > > The Draft PR:   https://github.com/apache/kafka/pull/20913
> > >> >> Problem:
> > >> >> > > > Currently,
> > >> >> > > > Kafka's tiered storage implementation uploads all non-active
> > >> local
> > >> >> log
> > >> >> > > > segments to remote storage immediately, even when they are
> > still
> > >> >> within
> > >> >> > > the
> > >> >> > > > local retention period.
> > >> >> > > > This results in redundant storage of the same data in both
> > local
> > >> and
> > >> >> > > remote
> > >> >> > > > tiers.
> > >> >> > > >
> > >> >> > > > When there is no requirement for real-time analytics or
> > immediate
> > >> >> > > > consumption based on remote storage. It has the following
> > >> drawbacks:
> > >> >> > > >
> > >> >> > > > 1. Wastes storage capacity and costs: The same data is stored
> > >> twice
> > >> >> > > during
> > >> >> > > > the local retention window
> > >> >> > > > 2. Provides no immediate benefit: During the local retention
> > >> period,
> > >> >> > > reads
> > >> >> > > > prioritize local data, making the remote copy unnecessary
> > >> >> > > >
> > >> >> > > >
> > >> >> > > > So. this KIP is to reduce tiered storage redundancy with
> > delayed
> > >> >> upload.
> > >> >> > > > You can check the test result example here directly:
> > >> >> > > >
> > >> https://github.com/apache/kafka/pull/20913#issuecomment-3547156286
> > >> >> > > > Looking forward to your feedback! Best regards, Jian
> > >> >> > > >
> > >> >> >
> > >> >
> > >> >
> > >>
> > >
> > >
> > >
> > >
> >
>

Re: Re: [DISCUSS] KIP-1241: Reduce tiered storage redundancy with delayed upload

Reply via email to