Re: Re: [DISCUSS] KIP-1241: Reduce tiered storage redundancy with delayed upload

jian fu Sun, 04 Jan 2026 05:19:31 -0800

Hi All:

Happy New Year! ! Bumping this thread again for more possible discussion
before the vote starts.
Thanks a lot !


Regards
Jian

jian fu <[email protected]> 于2025年12月15日周一 20:00写道：

> Hi All:
>
> Bumping this thread for more discussion. I’d really appreciate more
> suggestions on this optional feature for tiered storage. Thanks a lot !
>
> Regards
>
> Jian
>
> jian fu <[email protected]> 于2025年12月4日周四 21:54写道：
>
>> Hi All:
>>
>> I updated the KIP content according to Kamal and Haiying's discussion:
>> 1  Explicitly emphasized that this is a topic-level optional feature
>> intended for users who prioritize cost.
>> 1  Added  the cost-saving calculation example
>> 2  Added  additional details about the operational drawback of this
>> feature: need extra disk expansion for the case: long time remote
>> storage's outage.
>> 3  Added  the scenarios where it may not be very suitable/ beneficial to
>> enable the feature such as the topic's ratio for remote:local retention is
>> a very big value.
>>
>> Thanks again for joining the discussion.
>>
>> Regards
>> Jian
>>
>> jian fu <[email protected]> 于2025年12月2日周二 20:27写道：
>>
>>> Hi Kamal:
>>>
>>> I think I understand what you mean now. I’ve updated the picture in the
>>> link(https://github.com/apache/kafka/pull/20913#issuecomment-3601274230)
>>> .
>>> Could you help double-check whether we’ve reached the same understanding?
>>> In short. the drawback of this KIP is that, during a long time remote
>>> storage outage. it will occupied more disk. The max value is the redundant
>>> part we saving.
>>> Thus. After the outage recovered. It will come back to the beginning.
>>> Pls help to correct me if my understanding is wrong!  Thanks again.
>>>
>>> Regards
>>> Jian
>>>
>>> Kamal Chandraprakash <[email protected]> 于2025年12月2日周二
>>> 19:29写道：
>>>
>>>> The already uploaded segments are eligible for deletion from the broker.
>>>> So, when remote storage is down,
>>>> then those segments can be deleted as per the local retention settings
>>>> and
>>>> new segments can occupy those spaces.
>>>> This provides more time for the Admin to act when remote storage is down
>>>> for a longer time.
>>>>
>>>> This is from a reliability perspective.
>>>>
>>>> On Tue, Dec 2, 2025 at 4:47 PM jian fu <[email protected]> wrote:
>>>>
>>>> > Hi Kamal and Haiying Cai:
>>>> >
>>>> > maybe you notice that my kafka clusters set 1day local + 3 days-7 days
>>>> > remote. thus  Haiying Cai‘s configure is 3 hours local + 3 days
>>>> remote.
>>>> >
>>>> > I can explain more about my configure.
>>>> > I try to avoid the latency for some delay consumer to access the
>>>> remote.
>>>> > Maybe some applications may encounter some unexpected issue. but we
>>>> need to
>>>> > give enough time to handle it. In the period, we don't want the
>>>> consumer to
>>>> > access the remote to hurt the whole kafka clusters. So one day is our
>>>> > expectation.
>>>> >
>>>> > I  saw one statement in  Haiying Cai  KIP1248:
>>>> > " Currently, when a new consumer or a fallen-off consumer requires
>>>> fetching
>>>> > messages from a while ago, and those messages are no longer present
>>>> in the
>>>> > Kafka broker's local storage, the broker must download the message
>>>> from the
>>>> > remote tiered storage and subsequently transfer the data back to the
>>>> > consumer.   "
>>>> > Extend the local retention time is how we try to avoid the issue
>>>> (Here, we
>>>> > don't consider the case one new consumer use the earliest strategy to
>>>> > consume. it is not often happen in our cases.)
>>>> >
>>>> > So. based my configure. I will see there is one day's duplicated
>>>> segment
>>>> > wasting in remote storage. Thus I don't use them for real time
>>>> analyst or
>>>> > care about the fast reboot or some thing else.  So propose this KIP
>>>> to take
>>>> > one topic level optional feature to help us to reduce waste and save
>>>> money.
>>>> >
>>>> > Regards
>>>> > Jian
>>>> >
>>>> > jian fu <[email protected]> 于2025年12月2日周二 18:42写道：
>>>> >
>>>> > > Hi  Kamal:
>>>> > >
>>>> > > Thanks for joining this discussion. Let me try to classify my
>>>> understands
>>>> > > for your good questions:
>>>> > >
>>>> > > 1  Kamal : Do you also have to update the RemoteCopy lag segments
>>>> and
>>>> > > bytes metric?
>>>> > >     Jian:  The code just delay the upload time for local segment.
>>>> So it
>>>> > > seems there is no need to change any lag segments or metrics. right?
>>>> > >
>>>> > > 2   Kamal :  As Haiying mentioned, the segments get eventually
>>>> uploaded
>>>> > to
>>>> > > remote so not sure about the
>>>> > > benefit of this proposal. And, remote storage cost is considered as
>>>> low
>>>> > > when compared to broker local-disk.
>>>> > >      Jian: The cost benefit is about the total size for occupied.
>>>> Take
>>>> > AWS
>>>> > > S3 as example. Tiered price for: 1 GB is 0.02 USD (You can refer to
>>>> > > https://calculator.aws/#/createCalculator/S3).
>>>> > >   It is cheaper than local disk. So as I mentioned that the saving
>>>> money
>>>> > > depend on the ratio local vs remote retention time.  If your set the
>>>> > remote
>>>> > > storage time as a long time. The benefit is few, It is just
>>>> avoiding the
>>>> > > waste instead of cost saving.
>>>> > >   So I take it as topic level optional configure instead of default
>>>> > > feature.
>>>> > >
>>>> > > 3  Kamal:   It provides some cushion during third-party object
>>>> storage
>>>> > > downtime.
>>>> > >      Jian:   I draw one picture to try to under the logic(
>>>> > > https://github.com/apache/kafka/pull/20913#issuecomment-3601274230).
>>>> You
>>>> > > can help to check if my understanding is right. I seemed that no
>>>> > difference
>>>> > > for them. So for this question. maybe we need to discuss more about
>>>> it.
>>>> > The
>>>> > > only difference maybe we may increase a little local disk for temp
>>>> due to
>>>> > > the delay for upload remote. So in the original proposal. I want to
>>>> > upload
>>>> > > N-1 segments. But it seems the value is not much.
>>>> > >
>>>> > > BTW. I want to classify one basic rule: this feature isn't to
>>>> change the
>>>> > > default behavior. and the saving amount is not very big value in all
>>>> > cases.
>>>> > > It is suitable for part of topic which set a low ratio for
>>>> remote/local
>>>> > > such as 7days/1days or 3days/1day
>>>> > > At the last. Thanks again for your time and your comments. All the
>>>> > > questions are valid and good for us to thing more about it.
>>>> > >
>>>> > > Regards
>>>> > > Jian
>>>> > >
>>>> > >
>>>> > > Kamal Chandraprakash <[email protected]> 于2025年12月2日周二
>>>> > > 17:41写道：
>>>> > >
>>>> > >> 1. Do you also have to update the RemoteCopy lag segments and bytes
>>>> > >> metric?
>>>> > >> 2. As Haiying mentioned, the segments get eventually uploaded to
>>>> remote
>>>> > so
>>>> > >> not sure about the
>>>> > >> benefit of this proposal. And, remote storage cost is considered
>>>> as low
>>>> > >> when compared to broker local-disk.
>>>> > >> It provides some cushion during third-party object storage
>>>> downtime.
>>>> > >>
>>>> > >>
>>>> > >> On Tue, Dec 2, 2025 at 2:45 PM Kamal Chandraprakash <
>>>> > >> [email protected]> wrote:
>>>> > >>
>>>> > >> > Hi Jian,
>>>> > >> >
>>>> > >> > Thanks for the KIP!
>>>> > >> >
>>>> > >> > When remote storage is unavailable for a few hrs, then with lazy
>>>> > upload
>>>> > >> > there is a risk of the broker disk getting full soon.
>>>> > >> > The Admin has to configure the local retention configs
>>>> properly.  With
>>>> > >> > eager upload, the disk utilization won't grow
>>>> > >> > until the local retention time (expectation is that all the
>>>> > >> > passive segments are uploaded). And, provides some time
>>>> > >> > for the Admin to take any action based on the situation.
>>>> > >> >
>>>> > >> > --
>>>> > >> > Kamal
>>>> > >> >
>>>> > >> > On Tue, Dec 2, 2025 at 10:28 AM Haiying Cai via dev <
>>>> > >> [email protected]>
>>>> > >> > wrote:
>>>> > >> >
>>>> > >> >> Jian,
>>>> > >> >>
>>>> > >> >> Understands this is an optional feature and the cost saving
>>>> depends
>>>> > on
>>>> > >> >> the ratio between local.retention.ms and total retention.ms.
>>>> > >> >>
>>>> > >> >> In our setup, we have local.retention set to 3 hours and total
>>>> > >> retention
>>>> > >> >> set to 3 days, so the saving is not going to be significant.
>>>> > >> >>
>>>> > >> >> On 2025/12/01 05:33:11 jian fu wrote:
>>>> > >> >> > Hi Haiying Cai,
>>>> > >> >> >
>>>> > >> >> > Thanks for joining the discussion for this KIP. All of your
>>>> > concerns
>>>> > >> are
>>>> > >> >> > valid, and that is exactly why I introduced a topic-level
>>>> > >> configuration
>>>> > >> >> to
>>>> > >> >> > make this feature optional. This means that, by default, the
>>>> > behavior
>>>> > >> >> > remains unchanged. Only when users are not pursuing faster
>>>> broker
>>>> > >> boot
>>>> > >> >> time
>>>> > >> >> > or other optimizations — and care more about cost — would they
>>>> > enable
>>>> > >> >> this
>>>> > >> >> > option to some topics to save resources.
>>>> > >> >> >
>>>> > >> >> > Regarding cost self: the actual savings depend on the ratio
>>>> between
>>>> > >> >> local
>>>> > >> >> > retention and remote retention. In the KIP/PR, I provided a
>>>> test
>>>> > >> >> example:
>>>> > >> >> > if we configure 1 day of local retention and 2 days of remote
>>>> > >> >> retention, we
>>>> > >> >> > can save about 50%. And realistically, I don't think anyone
>>>> would
>>>> > >> boldly
>>>> > >> >> > set local retention to a very small value (such as minutes)
>>>> due to
>>>> > >> the
>>>> > >> >> > latency concerns associated with remote storage. So in short,
>>>> the
>>>> > >> >> feature
>>>> > >> >> > will help reduce cost, and the amount saved simply depends on
>>>> the
>>>> > >> ratio.
>>>> > >> >> > Take my company's usage as real example, we configure most of
>>>> the
>>>> > >> >> topics: 1
>>>> > >> >> > day of local retention and 3–7 days of remote storage (3 days
>>>> for
>>>> > >> topic
>>>> > >> >> > with log/metric usage, 7 days for topic with normal business
>>>> > usage).
>>>> > >> >> and we
>>>> > >> >> > don't care about the boot speed and some thing else, This KIP
>>>> > allows
>>>> > >> us
>>>> > >> >> to
>>>> > >> >> > save 1/7 to 1/3 of the total disk usage for remote storage.
>>>> > >> >> >
>>>> > >> >> > Anyway, this is just a topic-level optional feature which
>>>> don't
>>>> > >> reject
>>>> > >> >> the
>>>> > >> >> > benifit for current design. Thanks again for the discussion.
>>>> I can
>>>> > >> >> update
>>>> > >> >> > the KIP to better classify scenarios where this optional
>>>> feature is
>>>> > >> not
>>>> > >> >> > suitable. Currently, I only listed real-time analytics as the
>>>> > >> negative
>>>> > >> >> > example.
>>>> > >> >> >
>>>> > >> >> > Welcome further discussion to help make this KIP more
>>>> complete.
>>>> > >> Thanks!
>>>> > >> >> >
>>>> > >> >> > Regards,
>>>> > >> >> > Jian
>>>> > >> >> >
>>>> > >> >> > Haiying Cai via dev <[email protected]> 于2025年12月1日周一
>>>> > 12:40写道：
>>>> > >> >> >
>>>> > >> >> > > Jian,
>>>> > >> >> > >
>>>> > >> >> > > Thanks for the contribution.  But I feel the uploading the
>>>> local
>>>> > >> >> segment
>>>> > >> >> > > file to remote storage ASAP is advantageous in several
>>>> scenarios:
>>>> > >> >> > >
>>>> > >> >> > > 1. Enable the fast bootstrapping a new broker.  A new broker
>>>> > >> doesn’t
>>>> > >> >> have
>>>> > >> >> > > to replicate all the data from the leader broker, it only
>>>> needs
>>>> > to
>>>> > >> >> > > replicate the data from the tail of the remote log segment
>>>> to the
>>>> > >> >> tail of
>>>> > >> >> > > the current end of the topic (LSO) since all the other data
>>>> are
>>>> > in
>>>> > >> the
>>>> > >> >> > > remote tiered storage and it can download them later
>>>> lazily, this
>>>> > >> is
>>>> > >> >> what
>>>> > >> >> > > KIP-1023 trying to solve;
>>>> > >> >> > > 2. Although nobody has proposed a KIP to allow a consumer
>>>> client
>>>> > to
>>>> > >> >> read
>>>> > >> >> > > from the remote tiered storage directly, but this will
>>>> helps the
>>>> > >> >> > > fall-behind consumer to do catch-up reads or perform the
>>>> > backfill.
>>>> > >> >> This
>>>> > >> >> > > path allows the consumer backfill to finish without
>>>> polluting the
>>>> > >> >> broker’s
>>>> > >> >> > > page cache.  The earlier the data is on the remote tiered
>>>> > storage,
>>>> > >> >> the more
>>>> > >> >> > > advantageous it is for the client.
>>>> > >> >> > >
>>>> > >> >> > > I think in your Proposal, you are delaying uploading the
>>>> segment
>>>> > >> but
>>>> > >> >> the
>>>> > >> >> > > file will still be uploaded at a later time, I guess this
>>>> can
>>>> > >> saves a
>>>> > >> >> few
>>>> > >> >> > > hours storage cost for that file in the remote storage, not
>>>> sure
>>>> > >> >> whether
>>>> > >> >> > > that is a significant cost saved (if the file needs to stay
>>>> in
>>>> > >> remote
>>>> > >> >> > > tiered storage for several days or weeks due to retention
>>>> > policy).
>>>> > >> >> > >
>>>> > >> >> > > On 2025/11/19 13:29:11 jian fu wrote:
>>>> > >> >> > > > Hi everyone, I'd like to start a discussion on KIP-1241,
>>>> the
>>>> > goal
>>>> > >> >> is to
>>>> > >> >> > > > reduce the remote storage. KIP:
>>>> > >> >> > > >
>>>> > >> >> > >
>>>> > >> >>
>>>> > >>
>>>> >
>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1241%3A+Reduce+tiered+storage+redundancy+with+delayed+upload
>>>> > >> >> > > >
>>>> > >> >> > > > The Draft PR:
>>>> https://github.com/apache/kafka/pull/20913
>>>> > >> >> Problem:
>>>> > >> >> > > > Currently,
>>>> > >> >> > > > Kafka's tiered storage implementation uploads all
>>>> non-active
>>>> > >> local
>>>> > >> >> log
>>>> > >> >> > > > segments to remote storage immediately, even when they are
>>>> > still
>>>> > >> >> within
>>>> > >> >> > > the
>>>> > >> >> > > > local retention period.
>>>> > >> >> > > > This results in redundant storage of the same data in both
>>>> > local
>>>> > >> and
>>>> > >> >> > > remote
>>>> > >> >> > > > tiers.
>>>> > >> >> > > >
>>>> > >> >> > > > When there is no requirement for real-time analytics or
>>>> > immediate
>>>> > >> >> > > > consumption based on remote storage. It has the following
>>>> > >> drawbacks:
>>>> > >> >> > > >
>>>> > >> >> > > > 1. Wastes storage capacity and costs: The same data is
>>>> stored
>>>> > >> twice
>>>> > >> >> > > during
>>>> > >> >> > > > the local retention window
>>>> > >> >> > > > 2. Provides no immediate benefit: During the local
>>>> retention
>>>> > >> period,
>>>> > >> >> > > reads
>>>> > >> >> > > > prioritize local data, making the remote copy unnecessary
>>>> > >> >> > > >
>>>> > >> >> > > >
>>>> > >> >> > > > So. this KIP is to reduce tiered storage redundancy with
>>>> > delayed
>>>> > >> >> upload.
>>>> > >> >> > > > You can check the test result example here directly:
>>>> > >> >> > > >
>>>> > >> https://github.com/apache/kafka/pull/20913#issuecomment-3547156286
>>>> > >> >> > > > Looking forward to your feedback! Best regards, Jian
>>>> > >> >> > > >
>>>> > >> >> >
>>>> > >> >
>>>> > >> >
>>>> > >>
>>>> > >
>>>> > >
>>>> > >
>>>> > >
>>>> >
>>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>
> --
> Regards
>
> Fu.Jian
>
>
>

Re: Re: [DISCUSS] KIP-1241: Reduce tiered storage redundancy with delayed upload

Reply via email to