Re: Re: [DISCUSS] KIP-1241: Reduce tiered storage redundancy with delayed upload

jian fu Mon, 15 Dec 2025 04:01:24 -0800

Hi All:

Bumping this thread for more discussion. I’d really appreciate more
suggestions on this optional feature for tiered storage. Thanks a lot !


Regards

Jian

jian fu <[email protected]> 于2025年12月4日周四 21:54写道：

> Hi All:
>
> I updated the KIP content according to Kamal and Haiying's discussion:
> 1  Explicitly emphasized that this is a topic-level optional feature
> intended for users who prioritize cost.
> 1  Added  the cost-saving calculation example
> 2  Added  additional details about the operational drawback of this
> feature: need extra disk expansion for the case: long time remote
> storage's outage.
> 3  Added  the scenarios where it may not be very suitable/ beneficial to
> enable the feature such as the topic's ratio for remote:local retention is
> a very big value.
>
> Thanks again for joining the discussion.
>
> Regards
> Jian
>
> jian fu <[email protected]> 于2025年12月2日周二 20:27写道：
>
>> Hi Kamal:
>>
>> I think I understand what you mean now. I’ve updated the picture in the
>> link(https://github.com/apache/kafka/pull/20913#issuecomment-3601274230)
>> .
>> Could you help double-check whether we’ve reached the same understanding?
>> In short. the drawback of this KIP is that, during a long time remote
>> storage outage. it will occupied more disk. The max value is the redundant
>> part we saving.
>> Thus. After the outage recovered. It will come back to the beginning.
>> Pls help to correct me if my understanding is wrong!  Thanks again.
>>
>> Regards
>> Jian
>>
>> Kamal Chandraprakash <[email protected]> 于2025年12月2日周二
>> 19:29写道：
>>
>>> The already uploaded segments are eligible for deletion from the broker.
>>> So, when remote storage is down,
>>> then those segments can be deleted as per the local retention settings
>>> and
>>> new segments can occupy those spaces.
>>> This provides more time for the Admin to act when remote storage is down
>>> for a longer time.
>>>
>>> This is from a reliability perspective.
>>>
>>> On Tue, Dec 2, 2025 at 4:47 PM jian fu <[email protected]> wrote:
>>>
>>> > Hi Kamal and Haiying Cai:
>>> >
>>> > maybe you notice that my kafka clusters set 1day local + 3 days-7 days
>>> > remote. thus  Haiying Cai‘s configure is 3 hours local + 3 days remote.
>>> >
>>> > I can explain more about my configure.
>>> > I try to avoid the latency for some delay consumer to access the
>>> remote.
>>> > Maybe some applications may encounter some unexpected issue. but we
>>> need to
>>> > give enough time to handle it. In the period, we don't want the
>>> consumer to
>>> > access the remote to hurt the whole kafka clusters. So one day is our
>>> > expectation.
>>> >
>>> > I  saw one statement in  Haiying Cai  KIP1248:
>>> > " Currently, when a new consumer or a fallen-off consumer requires
>>> fetching
>>> > messages from a while ago, and those messages are no longer present in
>>> the
>>> > Kafka broker's local storage, the broker must download the message
>>> from the
>>> > remote tiered storage and subsequently transfer the data back to the
>>> > consumer.   "
>>> > Extend the local retention time is how we try to avoid the issue
>>> (Here, we
>>> > don't consider the case one new consumer use the earliest strategy to
>>> > consume. it is not often happen in our cases.)
>>> >
>>> > So. based my configure. I will see there is one day's duplicated
>>> segment
>>> > wasting in remote storage. Thus I don't use them for real time analyst
>>> or
>>> > care about the fast reboot or some thing else.  So propose this KIP to
>>> take
>>> > one topic level optional feature to help us to reduce waste and save
>>> money.
>>> >
>>> > Regards
>>> > Jian
>>> >
>>> > jian fu <[email protected]> 于2025年12月2日周二 18:42写道：
>>> >
>>> > > Hi  Kamal:
>>> > >
>>> > > Thanks for joining this discussion. Let me try to classify my
>>> understands
>>> > > for your good questions:
>>> > >
>>> > > 1  Kamal : Do you also have to update the RemoteCopy lag segments and
>>> > > bytes metric?
>>> > >     Jian:  The code just delay the upload time for local segment. So
>>> it
>>> > > seems there is no need to change any lag segments or metrics. right?
>>> > >
>>> > > 2   Kamal :  As Haiying mentioned, the segments get eventually
>>> uploaded
>>> > to
>>> > > remote so not sure about the
>>> > > benefit of this proposal. And, remote storage cost is considered as
>>> low
>>> > > when compared to broker local-disk.
>>> > >      Jian: The cost benefit is about the total size for occupied.
>>> Take
>>> > AWS
>>> > > S3 as example. Tiered price for: 1 GB is 0.02 USD (You can refer to
>>> > > https://calculator.aws/#/createCalculator/S3).
>>> > >   It is cheaper than local disk. So as I mentioned that the saving
>>> money
>>> > > depend on the ratio local vs remote retention time.  If your set the
>>> > remote
>>> > > storage time as a long time. The benefit is few, It is just avoiding
>>> the
>>> > > waste instead of cost saving.
>>> > >   So I take it as topic level optional configure instead of default
>>> > > feature.
>>> > >
>>> > > 3  Kamal:   It provides some cushion during third-party object
>>> storage
>>> > > downtime.
>>> > >      Jian:   I draw one picture to try to under the logic(
>>> > > https://github.com/apache/kafka/pull/20913#issuecomment-3601274230).
>>> You
>>> > > can help to check if my understanding is right. I seemed that no
>>> > difference
>>> > > for them. So for this question. maybe we need to discuss more about
>>> it.
>>> > The
>>> > > only difference maybe we may increase a little local disk for temp
>>> due to
>>> > > the delay for upload remote. So in the original proposal. I want to
>>> > upload
>>> > > N-1 segments. But it seems the value is not much.
>>> > >
>>> > > BTW. I want to classify one basic rule: this feature isn't to change
>>> the
>>> > > default behavior. and the saving amount is not very big value in all
>>> > cases.
>>> > > It is suitable for part of topic which set a low ratio for
>>> remote/local
>>> > > such as 7days/1days or 3days/1day
>>> > > At the last. Thanks again for your time and your comments. All the
>>> > > questions are valid and good for us to thing more about it.
>>> > >
>>> > > Regards
>>> > > Jian
>>> > >
>>> > >
>>> > > Kamal Chandraprakash <[email protected]> 于2025年12月2日周二
>>> > > 17:41写道：
>>> > >
>>> > >> 1. Do you also have to update the RemoteCopy lag segments and bytes
>>> > >> metric?
>>> > >> 2. As Haiying mentioned, the segments get eventually uploaded to
>>> remote
>>> > so
>>> > >> not sure about the
>>> > >> benefit of this proposal. And, remote storage cost is considered as
>>> low
>>> > >> when compared to broker local-disk.
>>> > >> It provides some cushion during third-party object storage downtime.
>>> > >>
>>> > >>
>>> > >> On Tue, Dec 2, 2025 at 2:45 PM Kamal Chandraprakash <
>>> > >> [email protected]> wrote:
>>> > >>
>>> > >> > Hi Jian,
>>> > >> >
>>> > >> > Thanks for the KIP!
>>> > >> >
>>> > >> > When remote storage is unavailable for a few hrs, then with lazy
>>> > upload
>>> > >> > there is a risk of the broker disk getting full soon.
>>> > >> > The Admin has to configure the local retention configs properly.
>>> With
>>> > >> > eager upload, the disk utilization won't grow
>>> > >> > until the local retention time (expectation is that all the
>>> > >> > passive segments are uploaded). And, provides some time
>>> > >> > for the Admin to take any action based on the situation.
>>> > >> >
>>> > >> > --
>>> > >> > Kamal
>>> > >> >
>>> > >> > On Tue, Dec 2, 2025 at 10:28 AM Haiying Cai via dev <
>>> > >> [email protected]>
>>> > >> > wrote:
>>> > >> >
>>> > >> >> Jian,
>>> > >> >>
>>> > >> >> Understands this is an optional feature and the cost saving
>>> depends
>>> > on
>>> > >> >> the ratio between local.retention.ms and total retention.ms.
>>> > >> >>
>>> > >> >> In our setup, we have local.retention set to 3 hours and total
>>> > >> retention
>>> > >> >> set to 3 days, so the saving is not going to be significant.
>>> > >> >>
>>> > >> >> On 2025/12/01 05:33:11 jian fu wrote:
>>> > >> >> > Hi Haiying Cai,
>>> > >> >> >
>>> > >> >> > Thanks for joining the discussion for this KIP. All of your
>>> > concerns
>>> > >> are
>>> > >> >> > valid, and that is exactly why I introduced a topic-level
>>> > >> configuration
>>> > >> >> to
>>> > >> >> > make this feature optional. This means that, by default, the
>>> > behavior
>>> > >> >> > remains unchanged. Only when users are not pursuing faster
>>> broker
>>> > >> boot
>>> > >> >> time
>>> > >> >> > or other optimizations — and care more about cost — would they
>>> > enable
>>> > >> >> this
>>> > >> >> > option to some topics to save resources.
>>> > >> >> >
>>> > >> >> > Regarding cost self: the actual savings depend on the ratio
>>> between
>>> > >> >> local
>>> > >> >> > retention and remote retention. In the KIP/PR, I provided a
>>> test
>>> > >> >> example:
>>> > >> >> > if we configure 1 day of local retention and 2 days of remote
>>> > >> >> retention, we
>>> > >> >> > can save about 50%. And realistically, I don't think anyone
>>> would
>>> > >> boldly
>>> > >> >> > set local retention to a very small value (such as minutes)
>>> due to
>>> > >> the
>>> > >> >> > latency concerns associated with remote storage. So in short,
>>> the
>>> > >> >> feature
>>> > >> >> > will help reduce cost, and the amount saved simply depends on
>>> the
>>> > >> ratio.
>>> > >> >> > Take my company's usage as real example, we configure most of
>>> the
>>> > >> >> topics: 1
>>> > >> >> > day of local retention and 3–7 days of remote storage (3 days
>>> for
>>> > >> topic
>>> > >> >> > with log/metric usage, 7 days for topic with normal business
>>> > usage).
>>> > >> >> and we
>>> > >> >> > don't care about the boot speed and some thing else, This KIP
>>> > allows
>>> > >> us
>>> > >> >> to
>>> > >> >> > save 1/7 to 1/3 of the total disk usage for remote storage.
>>> > >> >> >
>>> > >> >> > Anyway, this is just a topic-level optional feature which don't
>>> > >> reject
>>> > >> >> the
>>> > >> >> > benifit for current design. Thanks again for the discussion. I
>>> can
>>> > >> >> update
>>> > >> >> > the KIP to better classify scenarios where this optional
>>> feature is
>>> > >> not
>>> > >> >> > suitable. Currently, I only listed real-time analytics as the
>>> > >> negative
>>> > >> >> > example.
>>> > >> >> >
>>> > >> >> > Welcome further discussion to help make this KIP more complete.
>>> > >> Thanks!
>>> > >> >> >
>>> > >> >> > Regards,
>>> > >> >> > Jian
>>> > >> >> >
>>> > >> >> > Haiying Cai via dev <[email protected]> 于2025年12月1日周一
>>> > 12:40写道：
>>> > >> >> >
>>> > >> >> > > Jian,
>>> > >> >> > >
>>> > >> >> > > Thanks for the contribution.  But I feel the uploading the
>>> local
>>> > >> >> segment
>>> > >> >> > > file to remote storage ASAP is advantageous in several
>>> scenarios:
>>> > >> >> > >
>>> > >> >> > > 1. Enable the fast bootstrapping a new broker.  A new broker
>>> > >> doesn’t
>>> > >> >> have
>>> > >> >> > > to replicate all the data from the leader broker, it only
>>> needs
>>> > to
>>> > >> >> > > replicate the data from the tail of the remote log segment
>>> to the
>>> > >> >> tail of
>>> > >> >> > > the current end of the topic (LSO) since all the other data
>>> are
>>> > in
>>> > >> the
>>> > >> >> > > remote tiered storage and it can download them later lazily,
>>> this
>>> > >> is
>>> > >> >> what
>>> > >> >> > > KIP-1023 trying to solve;
>>> > >> >> > > 2. Although nobody has proposed a KIP to allow a consumer
>>> client
>>> > to
>>> > >> >> read
>>> > >> >> > > from the remote tiered storage directly, but this will helps
>>> the
>>> > >> >> > > fall-behind consumer to do catch-up reads or perform the
>>> > backfill.
>>> > >> >> This
>>> > >> >> > > path allows the consumer backfill to finish without
>>> polluting the
>>> > >> >> broker’s
>>> > >> >> > > page cache.  The earlier the data is on the remote tiered
>>> > storage,
>>> > >> >> the more
>>> > >> >> > > advantageous it is for the client.
>>> > >> >> > >
>>> > >> >> > > I think in your Proposal, you are delaying uploading the
>>> segment
>>> > >> but
>>> > >> >> the
>>> > >> >> > > file will still be uploaded at a later time, I guess this can
>>> > >> saves a
>>> > >> >> few
>>> > >> >> > > hours storage cost for that file in the remote storage, not
>>> sure
>>> > >> >> whether
>>> > >> >> > > that is a significant cost saved (if the file needs to stay
>>> in
>>> > >> remote
>>> > >> >> > > tiered storage for several days or weeks due to retention
>>> > policy).
>>> > >> >> > >
>>> > >> >> > > On 2025/11/19 13:29:11 jian fu wrote:
>>> > >> >> > > > Hi everyone, I'd like to start a discussion on KIP-1241,
>>> the
>>> > goal
>>> > >> >> is to
>>> > >> >> > > > reduce the remote storage. KIP:
>>> > >> >> > > >
>>> > >> >> > >
>>> > >> >>
>>> > >>
>>> >
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1241%3A+Reduce+tiered+storage+redundancy+with+delayed+upload
>>> > >> >> > > >
>>> > >> >> > > > The Draft PR:   https://github.com/apache/kafka/pull/20913
>>> > >> >> Problem:
>>> > >> >> > > > Currently,
>>> > >> >> > > > Kafka's tiered storage implementation uploads all
>>> non-active
>>> > >> local
>>> > >> >> log
>>> > >> >> > > > segments to remote storage immediately, even when they are
>>> > still
>>> > >> >> within
>>> > >> >> > > the
>>> > >> >> > > > local retention period.
>>> > >> >> > > > This results in redundant storage of the same data in both
>>> > local
>>> > >> and
>>> > >> >> > > remote
>>> > >> >> > > > tiers.
>>> > >> >> > > >
>>> > >> >> > > > When there is no requirement for real-time analytics or
>>> > immediate
>>> > >> >> > > > consumption based on remote storage. It has the following
>>> > >> drawbacks:
>>> > >> >> > > >
>>> > >> >> > > > 1. Wastes storage capacity and costs: The same data is
>>> stored
>>> > >> twice
>>> > >> >> > > during
>>> > >> >> > > > the local retention window
>>> > >> >> > > > 2. Provides no immediate benefit: During the local
>>> retention
>>> > >> period,
>>> > >> >> > > reads
>>> > >> >> > > > prioritize local data, making the remote copy unnecessary
>>> > >> >> > > >
>>> > >> >> > > >
>>> > >> >> > > > So. this KIP is to reduce tiered storage redundancy with
>>> > delayed
>>> > >> >> upload.
>>> > >> >> > > > You can check the test result example here directly:
>>> > >> >> > > >
>>> > >> https://github.com/apache/kafka/pull/20913#issuecomment-3547156286
>>> > >> >> > > > Looking forward to your feedback! Best regards, Jian
>>> > >> >> > > >
>>> > >> >> >
>>> > >> >
>>> > >> >
>>> > >>
>>> > >
>>> > >
>>> > >
>>> > >
>>> >
>>>
>>
>>
>>
>>
>>
>
>
>

-- 
Regards

Fu.Jian

Re: Re: [DISCUSS] KIP-1241: Reduce tiered storage redundancy with delayed upload

Reply via email to