Re: Re: [DISCUSS] KIP-1241: Reduce tiered storage redundancy with delayed upload

jian fu Wed, 07 Jan 2026 04:48:58 -0800

Hi Luke:

Thanks for your comments and I am sorry for the delayed response.
All of your understanding is right.


"However in some cases, users are expecting consumer read with low
latency as much as possible" that is my case which I need to keep at least
one day's low latency to let business handle consume lag or some other
issues.


Regarding your comments:


1. remote.log.keep.latest  or remote.log.latest.enable?

remote.log.latest.enable,  let me correct the typo in the KIP content
though the code is right.

2. About the configuration doc: Determines whether to upload all segments
to remote storage including the latest ones within local retention.
Why do we allow to upload the latest log segment to remote storage? The
latest log segment is the active log segment, right?

Sorry,  it is my mistake.  The graph in the KIP is right. But the configure
doc is wrong, It should be
"Determines whether to upload all non-active segments to remote storage,
including those still within local retention."


Done with the KIP content and related code change!

Thanks a lot for your review and you can help to review again.

Regards
Jian


Luke Chen <[email protected]> 于2026年1月6日周二 17:50写道：

> Hi Jian,
>
> Thanks for the KIP!
>
> So, your goal is
> 1. allow consumers who are reading the hot data can still read from the
> local storage.
> 2. try to avoid the duplicated data in local and remote as much as
> possible.
> Is my understanding correct?
>
> Currently, tiered storage keeps local for the local retention time/size
> because we don't want the consumers who read the hot data with high
> latency. In this period of time, the duplication in local and remote
> storage is indeed a waste of cost. Although I also agree the cost should
> not be that huge because usually the local retention should not set too
> high. However in some cases, users are expecting consumer read with low
> latency as much as possible, the local retention is set to a high value and
> only expecting "very cold" data stored in the remote storage. In this case,
> this KIP should be helpful.
>
>
> Comments:
> 1. remote.log.keep.latest  or remote.log.latest.enable?
> 2. About the configuration doc: Determines whether to upload all segments
> to remote storage including the latest ones within local retention.
> Why do we allow to upload the latest log segment to remote storage? The
> latest log segment is the active log segment, right?
>
>
> Thank you,
> Luke
>
>
>
>
> On Sun, Jan 4, 2026 at 10:19 PM jian fu <[email protected]> wrote:
>
> > Hi All:
> >
> > Happy New Year! ! Bumping this thread again for more possible discussion
> > before the vote starts.
> > Thanks a lot !
> >
> > Regards
> > Jian
> >
> > jian fu <[email protected]> 于2025年12月15日周一 20:00写道：
> >
> > > Hi All:
> > >
> > > Bumping this thread for more discussion. I’d really appreciate more
> > > suggestions on this optional feature for tiered storage. Thanks a lot !
> > >
> > > Regards
> > >
> > > Jian
> > >
> > > jian fu <[email protected]> 于2025年12月4日周四 21:54写道：
> > >
> > >> Hi All:
> > >>
> > >> I updated the KIP content according to Kamal and Haiying's discussion:
> > >> 1  Explicitly emphasized that this is a topic-level optional feature
> > >> intended for users who prioritize cost.
> > >> 1  Added  the cost-saving calculation example
> > >> 2  Added  additional details about the operational drawback of this
> > >> feature: need extra disk expansion for the case: long time remote
> > >> storage's outage.
> > >> 3  Added  the scenarios where it may not be very suitable/ beneficial
> to
> > >> enable the feature such as the topic's ratio for remote:local
> retention
> > is
> > >> a very big value.
> > >>
> > >> Thanks again for joining the discussion.
> > >>
> > >> Regards
> > >> Jian
> > >>
> > >> jian fu <[email protected]> 于2025年12月2日周二 20:27写道：
> > >>
> > >>> Hi Kamal:
> > >>>
> > >>> I think I understand what you mean now. I’ve updated the picture in
> the
> > >>> link(
> > https://github.com/apache/kafka/pull/20913#issuecomment-3601274230)
> > >>> .
> > >>> Could you help double-check whether we’ve reached the same
> > understanding?
> > >>> In short. the drawback of this KIP is that, during a long time remote
> > >>> storage outage. it will occupied more disk. The max value is the
> > redundant
> > >>> part we saving.
> > >>> Thus. After the outage recovered. It will come back to the beginning.
> > >>> Pls help to correct me if my understanding is wrong!  Thanks again.
> > >>>
> > >>> Regards
> > >>> Jian
> > >>>
> > >>> Kamal Chandraprakash <[email protected]> 于2025年12月2日周二
> > >>> 19:29写道：
> > >>>
> > >>>> The already uploaded segments are eligible for deletion from the
> > broker.
> > >>>> So, when remote storage is down,
> > >>>> then those segments can be deleted as per the local retention
> settings
> > >>>> and
> > >>>> new segments can occupy those spaces.
> > >>>> This provides more time for the Admin to act when remote storage is
> > down
> > >>>> for a longer time.
> > >>>>
> > >>>> This is from a reliability perspective.
> > >>>>
> > >>>> On Tue, Dec 2, 2025 at 4:47 PM jian fu <[email protected]>
> wrote:
> > >>>>
> > >>>> > Hi Kamal and Haiying Cai:
> > >>>> >
> > >>>> > maybe you notice that my kafka clusters set 1day local + 3 days-7
> > days
> > >>>> > remote. thus  Haiying Cai‘s configure is 3 hours local + 3 days
> > >>>> remote.
> > >>>> >
> > >>>> > I can explain more about my configure.
> > >>>> > I try to avoid the latency for some delay consumer to access the
> > >>>> remote.
> > >>>> > Maybe some applications may encounter some unexpected issue. but
> we
> > >>>> need to
> > >>>> > give enough time to handle it. In the period, we don't want the
> > >>>> consumer to
> > >>>> > access the remote to hurt the whole kafka clusters. So one day is
> > our
> > >>>> > expectation.
> > >>>> >
> > >>>> > I  saw one statement in  Haiying Cai  KIP1248:
> > >>>> > " Currently, when a new consumer or a fallen-off consumer requires
> > >>>> fetching
> > >>>> > messages from a while ago, and those messages are no longer
> present
> > >>>> in the
> > >>>> > Kafka broker's local storage, the broker must download the message
> > >>>> from the
> > >>>> > remote tiered storage and subsequently transfer the data back to
> the
> > >>>> > consumer.   "
> > >>>> > Extend the local retention time is how we try to avoid the issue
> > >>>> (Here, we
> > >>>> > don't consider the case one new consumer use the earliest strategy
> > to
> > >>>> > consume. it is not often happen in our cases.)
> > >>>> >
> > >>>> > So. based my configure. I will see there is one day's duplicated
> > >>>> segment
> > >>>> > wasting in remote storage. Thus I don't use them for real time
> > >>>> analyst or
> > >>>> > care about the fast reboot or some thing else.  So propose this
> KIP
> > >>>> to take
> > >>>> > one topic level optional feature to help us to reduce waste and
> save
> > >>>> money.
> > >>>> >
> > >>>> > Regards
> > >>>> > Jian
> > >>>> >
> > >>>> > jian fu <[email protected]> 于2025年12月2日周二 18:42写道：
> > >>>> >
> > >>>> > > Hi  Kamal:
> > >>>> > >
> > >>>> > > Thanks for joining this discussion. Let me try to classify my
> > >>>> understands
> > >>>> > > for your good questions:
> > >>>> > >
> > >>>> > > 1  Kamal : Do you also have to update the RemoteCopy lag
> segments
> > >>>> and
> > >>>> > > bytes metric?
> > >>>> > >     Jian:  The code just delay the upload time for local
> segment.
> > >>>> So it
> > >>>> > > seems there is no need to change any lag segments or metrics.
> > right?
> > >>>> > >
> > >>>> > > 2   Kamal :  As Haiying mentioned, the segments get eventually
> > >>>> uploaded
> > >>>> > to
> > >>>> > > remote so not sure about the
> > >>>> > > benefit of this proposal. And, remote storage cost is considered
> > as
> > >>>> low
> > >>>> > > when compared to broker local-disk.
> > >>>> > >      Jian: The cost benefit is about the total size for
> occupied.
> > >>>> Take
> > >>>> > AWS
> > >>>> > > S3 as example. Tiered price for: 1 GB is 0.02 USD (You can refer
> > to
> > >>>> > > https://calculator.aws/#/createCalculator/S3).
> > >>>> > >   It is cheaper than local disk. So as I mentioned that the
> saving
> > >>>> money
> > >>>> > > depend on the ratio local vs remote retention time.  If your set
> > the
> > >>>> > remote
> > >>>> > > storage time as a long time. The benefit is few, It is just
> > >>>> avoiding the
> > >>>> > > waste instead of cost saving.
> > >>>> > >   So I take it as topic level optional configure instead of
> > default
> > >>>> > > feature.
> > >>>> > >
> > >>>> > > 3  Kamal:   It provides some cushion during third-party object
> > >>>> storage
> > >>>> > > downtime.
> > >>>> > >      Jian:   I draw one picture to try to under the logic(
> > >>>> > >
> > https://github.com/apache/kafka/pull/20913#issuecomment-3601274230).
> > >>>> You
> > >>>> > > can help to check if my understanding is right. I seemed that no
> > >>>> > difference
> > >>>> > > for them. So for this question. maybe we need to discuss more
> > about
> > >>>> it.
> > >>>> > The
> > >>>> > > only difference maybe we may increase a little local disk for
> temp
> > >>>> due to
> > >>>> > > the delay for upload remote. So in the original proposal. I want
> > to
> > >>>> > upload
> > >>>> > > N-1 segments. But it seems the value is not much.
> > >>>> > >
> > >>>> > > BTW. I want to classify one basic rule: this feature isn't to
> > >>>> change the
> > >>>> > > default behavior. and the saving amount is not very big value in
> > all
> > >>>> > cases.
> > >>>> > > It is suitable for part of topic which set a low ratio for
> > >>>> remote/local
> > >>>> > > such as 7days/1days or 3days/1day
> > >>>> > > At the last. Thanks again for your time and your comments. All
> the
> > >>>> > > questions are valid and good for us to thing more about it.
> > >>>> > >
> > >>>> > > Regards
> > >>>> > > Jian
> > >>>> > >
> > >>>> > >
> > >>>> > > Kamal Chandraprakash <[email protected]>
> > 于2025年12月2日周二
> > >>>> > > 17:41写道：
> > >>>> > >
> > >>>> > >> 1. Do you also have to update the RemoteCopy lag segments and
> > bytes
> > >>>> > >> metric?
> > >>>> > >> 2. As Haiying mentioned, the segments get eventually uploaded
> to
> > >>>> remote
> > >>>> > so
> > >>>> > >> not sure about the
> > >>>> > >> benefit of this proposal. And, remote storage cost is
> considered
> > >>>> as low
> > >>>> > >> when compared to broker local-disk.
> > >>>> > >> It provides some cushion during third-party object storage
> > >>>> downtime.
> > >>>> > >>
> > >>>> > >>
> > >>>> > >> On Tue, Dec 2, 2025 at 2:45 PM Kamal Chandraprakash <
> > >>>> > >> [email protected]> wrote:
> > >>>> > >>
> > >>>> > >> > Hi Jian,
> > >>>> > >> >
> > >>>> > >> > Thanks for the KIP!
> > >>>> > >> >
> > >>>> > >> > When remote storage is unavailable for a few hrs, then with
> > lazy
> > >>>> > upload
> > >>>> > >> > there is a risk of the broker disk getting full soon.
> > >>>> > >> > The Admin has to configure the local retention configs
> > >>>> properly.  With
> > >>>> > >> > eager upload, the disk utilization won't grow
> > >>>> > >> > until the local retention time (expectation is that all the
> > >>>> > >> > passive segments are uploaded). And, provides some time
> > >>>> > >> > for the Admin to take any action based on the situation.
> > >>>> > >> >
> > >>>> > >> > --
> > >>>> > >> > Kamal
> > >>>> > >> >
> > >>>> > >> > On Tue, Dec 2, 2025 at 10:28 AM Haiying Cai via dev <
> > >>>> > >> [email protected]>
> > >>>> > >> > wrote:
> > >>>> > >> >
> > >>>> > >> >> Jian,
> > >>>> > >> >>
> > >>>> > >> >> Understands this is an optional feature and the cost saving
> > >>>> depends
> > >>>> > on
> > >>>> > >> >> the ratio between local.retention.ms and total retention.ms
> .
> > >>>> > >> >>
> > >>>> > >> >> In our setup, we have local.retention set to 3 hours and
> total
> > >>>> > >> retention
> > >>>> > >> >> set to 3 days, so the saving is not going to be significant.
> > >>>> > >> >>
> > >>>> > >> >> On 2025/12/01 05:33:11 jian fu wrote:
> > >>>> > >> >> > Hi Haiying Cai,
> > >>>> > >> >> >
> > >>>> > >> >> > Thanks for joining the discussion for this KIP. All of
> your
> > >>>> > concerns
> > >>>> > >> are
> > >>>> > >> >> > valid, and that is exactly why I introduced a topic-level
> > >>>> > >> configuration
> > >>>> > >> >> to
> > >>>> > >> >> > make this feature optional. This means that, by default,
> the
> > >>>> > behavior
> > >>>> > >> >> > remains unchanged. Only when users are not pursuing faster
> > >>>> broker
> > >>>> > >> boot
> > >>>> > >> >> time
> > >>>> > >> >> > or other optimizations — and care more about cost — would
> > they
> > >>>> > enable
> > >>>> > >> >> this
> > >>>> > >> >> > option to some topics to save resources.
> > >>>> > >> >> >
> > >>>> > >> >> > Regarding cost self: the actual savings depend on the
> ratio
> > >>>> between
> > >>>> > >> >> local
> > >>>> > >> >> > retention and remote retention. In the KIP/PR, I provided
> a
> > >>>> test
> > >>>> > >> >> example:
> > >>>> > >> >> > if we configure 1 day of local retention and 2 days of
> > remote
> > >>>> > >> >> retention, we
> > >>>> > >> >> > can save about 50%. And realistically, I don't think
> anyone
> > >>>> would
> > >>>> > >> boldly
> > >>>> > >> >> > set local retention to a very small value (such as
> minutes)
> > >>>> due to
> > >>>> > >> the
> > >>>> > >> >> > latency concerns associated with remote storage. So in
> > short,
> > >>>> the
> > >>>> > >> >> feature
> > >>>> > >> >> > will help reduce cost, and the amount saved simply depends
> > on
> > >>>> the
> > >>>> > >> ratio.
> > >>>> > >> >> > Take my company's usage as real example, we configure most
> > of
> > >>>> the
> > >>>> > >> >> topics: 1
> > >>>> > >> >> > day of local retention and 3–7 days of remote storage (3
> > days
> > >>>> for
> > >>>> > >> topic
> > >>>> > >> >> > with log/metric usage, 7 days for topic with normal
> business
> > >>>> > usage).
> > >>>> > >> >> and we
> > >>>> > >> >> > don't care about the boot speed and some thing else, This
> > KIP
> > >>>> > allows
> > >>>> > >> us
> > >>>> > >> >> to
> > >>>> > >> >> > save 1/7 to 1/3 of the total disk usage for remote
> storage.
> > >>>> > >> >> >
> > >>>> > >> >> > Anyway, this is just a topic-level optional feature which
> > >>>> don't
> > >>>> > >> reject
> > >>>> > >> >> the
> > >>>> > >> >> > benifit for current design. Thanks again for the
> discussion.
> > >>>> I can
> > >>>> > >> >> update
> > >>>> > >> >> > the KIP to better classify scenarios where this optional
> > >>>> feature is
> > >>>> > >> not
> > >>>> > >> >> > suitable. Currently, I only listed real-time analytics as
> > the
> > >>>> > >> negative
> > >>>> > >> >> > example.
> > >>>> > >> >> >
> > >>>> > >> >> > Welcome further discussion to help make this KIP more
> > >>>> complete.
> > >>>> > >> Thanks!
> > >>>> > >> >> >
> > >>>> > >> >> > Regards,
> > >>>> > >> >> > Jian
> > >>>> > >> >> >
> > >>>> > >> >> > Haiying Cai via dev <[email protected]>
> 于2025年12月1日周一
> > >>>> > 12:40写道：
> > >>>> > >> >> >
> > >>>> > >> >> > > Jian,
> > >>>> > >> >> > >
> > >>>> > >> >> > > Thanks for the contribution.  But I feel the uploading
> the
> > >>>> local
> > >>>> > >> >> segment
> > >>>> > >> >> > > file to remote storage ASAP is advantageous in several
> > >>>> scenarios:
> > >>>> > >> >> > >
> > >>>> > >> >> > > 1. Enable the fast bootstrapping a new broker.  A new
> > broker
> > >>>> > >> doesn’t
> > >>>> > >> >> have
> > >>>> > >> >> > > to replicate all the data from the leader broker, it
> only
> > >>>> needs
> > >>>> > to
> > >>>> > >> >> > > replicate the data from the tail of the remote log
> segment
> > >>>> to the
> > >>>> > >> >> tail of
> > >>>> > >> >> > > the current end of the topic (LSO) since all the other
> > data
> > >>>> are
> > >>>> > in
> > >>>> > >> the
> > >>>> > >> >> > > remote tiered storage and it can download them later
> > >>>> lazily, this
> > >>>> > >> is
> > >>>> > >> >> what
> > >>>> > >> >> > > KIP-1023 trying to solve;
> > >>>> > >> >> > > 2. Although nobody has proposed a KIP to allow a
> consumer
> > >>>> client
> > >>>> > to
> > >>>> > >> >> read
> > >>>> > >> >> > > from the remote tiered storage directly, but this will
> > >>>> helps the
> > >>>> > >> >> > > fall-behind consumer to do catch-up reads or perform the
> > >>>> > backfill.
> > >>>> > >> >> This
> > >>>> > >> >> > > path allows the consumer backfill to finish without
> > >>>> polluting the
> > >>>> > >> >> broker’s
> > >>>> > >> >> > > page cache.  The earlier the data is on the remote
> tiered
> > >>>> > storage,
> > >>>> > >> >> the more
> > >>>> > >> >> > > advantageous it is for the client.
> > >>>> > >> >> > >
> > >>>> > >> >> > > I think in your Proposal, you are delaying uploading the
> > >>>> segment
> > >>>> > >> but
> > >>>> > >> >> the
> > >>>> > >> >> > > file will still be uploaded at a later time, I guess
> this
> > >>>> can
> > >>>> > >> saves a
> > >>>> > >> >> few
> > >>>> > >> >> > > hours storage cost for that file in the remote storage,
> > not
> > >>>> sure
> > >>>> > >> >> whether
> > >>>> > >> >> > > that is a significant cost saved (if the file needs to
> > stay
> > >>>> in
> > >>>> > >> remote
> > >>>> > >> >> > > tiered storage for several days or weeks due to
> retention
> > >>>> > policy).
> > >>>> > >> >> > >
> > >>>> > >> >> > > On 2025/11/19 13:29:11 jian fu wrote:
> > >>>> > >> >> > > > Hi everyone, I'd like to start a discussion on
> KIP-1241,
> > >>>> the
> > >>>> > goal
> > >>>> > >> >> is to
> > >>>> > >> >> > > > reduce the remote storage. KIP:
> > >>>> > >> >> > > >
> > >>>> > >> >> > >
> > >>>> > >> >>
> > >>>> > >>
> > >>>> >
> > >>>>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1241%3A+Reduce+tiered+storage+redundancy+with+delayed+upload
> > >>>> > >> >> > > >
> > >>>> > >> >> > > > The Draft PR:
> > >>>> https://github.com/apache/kafka/pull/20913
> > >>>> > >> >> Problem:
> > >>>> > >> >> > > > Currently,
> > >>>> > >> >> > > > Kafka's tiered storage implementation uploads all
> > >>>> non-active
> > >>>> > >> local
> > >>>> > >> >> log
> > >>>> > >> >> > > > segments to remote storage immediately, even when they
> > are
> > >>>> > still
> > >>>> > >> >> within
> > >>>> > >> >> > > the
> > >>>> > >> >> > > > local retention period.
> > >>>> > >> >> > > > This results in redundant storage of the same data in
> > both
> > >>>> > local
> > >>>> > >> and
> > >>>> > >> >> > > remote
> > >>>> > >> >> > > > tiers.
> > >>>> > >> >> > > >
> > >>>> > >> >> > > > When there is no requirement for real-time analytics
> or
> > >>>> > immediate
> > >>>> > >> >> > > > consumption based on remote storage. It has the
> > following
> > >>>> > >> drawbacks:
> > >>>> > >> >> > > >
> > >>>> > >> >> > > > 1. Wastes storage capacity and costs: The same data is
> > >>>> stored
> > >>>> > >> twice
> > >>>> > >> >> > > during
> > >>>> > >> >> > > > the local retention window
> > >>>> > >> >> > > > 2. Provides no immediate benefit: During the local
> > >>>> retention
> > >>>> > >> period,
> > >>>> > >> >> > > reads
> > >>>> > >> >> > > > prioritize local data, making the remote copy
> > unnecessary
> > >>>> > >> >> > > >
> > >>>> > >> >> > > >
> > >>>> > >> >> > > > So. this KIP is to reduce tiered storage redundancy
> with
> > >>>> > delayed
> > >>>> > >> >> upload.
> > >>>> > >> >> > > > You can check the test result example here directly:
> > >>>> > >> >> > > >
> > >>>> > >>
> > https://github.com/apache/kafka/pull/20913#issuecomment-3547156286
> > >>>> > >> >> > > > Looking forward to your feedback! Best regards, Jian
> > >>>> > >> >> > > >
> > >>>> > >> >> >
> > >>>> > >> >
> > >>>> > >> >
> > >>>> > >>
> > >>>> > >
> > >>>> > >
> > >>>> > >
> > >>>> > >
> > >>>> >
> > >>>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>
> > >>
> > >>
> > >
> > > --
> > > Regards
> > >
> > > Fu.Jian
> > >
> > >
> > >
> >
>

Re: Re: [DISCUSS] KIP-1241: Reduce tiered storage redundancy with delayed upload

Reply via email to