Hi All: Bumping this thread for more discussion. I’d really appreciate more suggestions on this optional feature for tiered storage. Thanks a lot !
Regards Jian jian fu <[email protected]> 于2025年12月4日周四 21:54写道: > Hi All: > > I updated the KIP content according to Kamal and Haiying's discussion: > 1 Explicitly emphasized that this is a topic-level optional feature > intended for users who prioritize cost. > 1 Added the cost-saving calculation example > 2 Added additional details about the operational drawback of this > feature: need extra disk expansion for the case: long time remote > storage's outage. > 3 Added the scenarios where it may not be very suitable/ beneficial to > enable the feature such as the topic's ratio for remote:local retention is > a very big value. > > Thanks again for joining the discussion. > > Regards > Jian > > jian fu <[email protected]> 于2025年12月2日周二 20:27写道: > >> Hi Kamal: >> >> I think I understand what you mean now. I’ve updated the picture in the >> link(https://github.com/apache/kafka/pull/20913#issuecomment-3601274230) >> . >> Could you help double-check whether we’ve reached the same understanding? >> In short. the drawback of this KIP is that, during a long time remote >> storage outage. it will occupied more disk. The max value is the redundant >> part we saving. >> Thus. After the outage recovered. It will come back to the beginning. >> Pls help to correct me if my understanding is wrong! Thanks again. >> >> Regards >> Jian >> >> Kamal Chandraprakash <[email protected]> 于2025年12月2日周二 >> 19:29写道: >> >>> The already uploaded segments are eligible for deletion from the broker. >>> So, when remote storage is down, >>> then those segments can be deleted as per the local retention settings >>> and >>> new segments can occupy those spaces. >>> This provides more time for the Admin to act when remote storage is down >>> for a longer time. >>> >>> This is from a reliability perspective. >>> >>> On Tue, Dec 2, 2025 at 4:47 PM jian fu <[email protected]> wrote: >>> >>> > Hi Kamal and Haiying Cai: >>> > >>> > maybe you notice that my kafka clusters set 1day local + 3 days-7 days >>> > remote. thus Haiying Cai‘s configure is 3 hours local + 3 days remote. >>> > >>> > I can explain more about my configure. >>> > I try to avoid the latency for some delay consumer to access the >>> remote. >>> > Maybe some applications may encounter some unexpected issue. but we >>> need to >>> > give enough time to handle it. In the period, we don't want the >>> consumer to >>> > access the remote to hurt the whole kafka clusters. So one day is our >>> > expectation. >>> > >>> > I saw one statement in Haiying Cai KIP1248: >>> > " Currently, when a new consumer or a fallen-off consumer requires >>> fetching >>> > messages from a while ago, and those messages are no longer present in >>> the >>> > Kafka broker's local storage, the broker must download the message >>> from the >>> > remote tiered storage and subsequently transfer the data back to the >>> > consumer. " >>> > Extend the local retention time is how we try to avoid the issue >>> (Here, we >>> > don't consider the case one new consumer use the earliest strategy to >>> > consume. it is not often happen in our cases.) >>> > >>> > So. based my configure. I will see there is one day's duplicated >>> segment >>> > wasting in remote storage. Thus I don't use them for real time analyst >>> or >>> > care about the fast reboot or some thing else. So propose this KIP to >>> take >>> > one topic level optional feature to help us to reduce waste and save >>> money. >>> > >>> > Regards >>> > Jian >>> > >>> > jian fu <[email protected]> 于2025年12月2日周二 18:42写道: >>> > >>> > > Hi Kamal: >>> > > >>> > > Thanks for joining this discussion. Let me try to classify my >>> understands >>> > > for your good questions: >>> > > >>> > > 1 Kamal : Do you also have to update the RemoteCopy lag segments and >>> > > bytes metric? >>> > > Jian: The code just delay the upload time for local segment. So >>> it >>> > > seems there is no need to change any lag segments or metrics. right? >>> > > >>> > > 2 Kamal : As Haiying mentioned, the segments get eventually >>> uploaded >>> > to >>> > > remote so not sure about the >>> > > benefit of this proposal. And, remote storage cost is considered as >>> low >>> > > when compared to broker local-disk. >>> > > Jian: The cost benefit is about the total size for occupied. >>> Take >>> > AWS >>> > > S3 as example. Tiered price for: 1 GB is 0.02 USD (You can refer to >>> > > https://calculator.aws/#/createCalculator/S3). >>> > > It is cheaper than local disk. So as I mentioned that the saving >>> money >>> > > depend on the ratio local vs remote retention time. If your set the >>> > remote >>> > > storage time as a long time. The benefit is few, It is just avoiding >>> the >>> > > waste instead of cost saving. >>> > > So I take it as topic level optional configure instead of default >>> > > feature. >>> > > >>> > > 3 Kamal: It provides some cushion during third-party object >>> storage >>> > > downtime. >>> > > Jian: I draw one picture to try to under the logic( >>> > > https://github.com/apache/kafka/pull/20913#issuecomment-3601274230). >>> You >>> > > can help to check if my understanding is right. I seemed that no >>> > difference >>> > > for them. So for this question. maybe we need to discuss more about >>> it. >>> > The >>> > > only difference maybe we may increase a little local disk for temp >>> due to >>> > > the delay for upload remote. So in the original proposal. I want to >>> > upload >>> > > N-1 segments. But it seems the value is not much. >>> > > >>> > > BTW. I want to classify one basic rule: this feature isn't to change >>> the >>> > > default behavior. and the saving amount is not very big value in all >>> > cases. >>> > > It is suitable for part of topic which set a low ratio for >>> remote/local >>> > > such as 7days/1days or 3days/1day >>> > > At the last. Thanks again for your time and your comments. All the >>> > > questions are valid and good for us to thing more about it. >>> > > >>> > > Regards >>> > > Jian >>> > > >>> > > >>> > > Kamal Chandraprakash <[email protected]> 于2025年12月2日周二 >>> > > 17:41写道: >>> > > >>> > >> 1. Do you also have to update the RemoteCopy lag segments and bytes >>> > >> metric? >>> > >> 2. As Haiying mentioned, the segments get eventually uploaded to >>> remote >>> > so >>> > >> not sure about the >>> > >> benefit of this proposal. And, remote storage cost is considered as >>> low >>> > >> when compared to broker local-disk. >>> > >> It provides some cushion during third-party object storage downtime. >>> > >> >>> > >> >>> > >> On Tue, Dec 2, 2025 at 2:45 PM Kamal Chandraprakash < >>> > >> [email protected]> wrote: >>> > >> >>> > >> > Hi Jian, >>> > >> > >>> > >> > Thanks for the KIP! >>> > >> > >>> > >> > When remote storage is unavailable for a few hrs, then with lazy >>> > upload >>> > >> > there is a risk of the broker disk getting full soon. >>> > >> > The Admin has to configure the local retention configs properly. >>> With >>> > >> > eager upload, the disk utilization won't grow >>> > >> > until the local retention time (expectation is that all the >>> > >> > passive segments are uploaded). And, provides some time >>> > >> > for the Admin to take any action based on the situation. >>> > >> > >>> > >> > -- >>> > >> > Kamal >>> > >> > >>> > >> > On Tue, Dec 2, 2025 at 10:28 AM Haiying Cai via dev < >>> > >> [email protected]> >>> > >> > wrote: >>> > >> > >>> > >> >> Jian, >>> > >> >> >>> > >> >> Understands this is an optional feature and the cost saving >>> depends >>> > on >>> > >> >> the ratio between local.retention.ms and total retention.ms. >>> > >> >> >>> > >> >> In our setup, we have local.retention set to 3 hours and total >>> > >> retention >>> > >> >> set to 3 days, so the saving is not going to be significant. >>> > >> >> >>> > >> >> On 2025/12/01 05:33:11 jian fu wrote: >>> > >> >> > Hi Haiying Cai, >>> > >> >> > >>> > >> >> > Thanks for joining the discussion for this KIP. All of your >>> > concerns >>> > >> are >>> > >> >> > valid, and that is exactly why I introduced a topic-level >>> > >> configuration >>> > >> >> to >>> > >> >> > make this feature optional. This means that, by default, the >>> > behavior >>> > >> >> > remains unchanged. Only when users are not pursuing faster >>> broker >>> > >> boot >>> > >> >> time >>> > >> >> > or other optimizations — and care more about cost — would they >>> > enable >>> > >> >> this >>> > >> >> > option to some topics to save resources. >>> > >> >> > >>> > >> >> > Regarding cost self: the actual savings depend on the ratio >>> between >>> > >> >> local >>> > >> >> > retention and remote retention. In the KIP/PR, I provided a >>> test >>> > >> >> example: >>> > >> >> > if we configure 1 day of local retention and 2 days of remote >>> > >> >> retention, we >>> > >> >> > can save about 50%. And realistically, I don't think anyone >>> would >>> > >> boldly >>> > >> >> > set local retention to a very small value (such as minutes) >>> due to >>> > >> the >>> > >> >> > latency concerns associated with remote storage. So in short, >>> the >>> > >> >> feature >>> > >> >> > will help reduce cost, and the amount saved simply depends on >>> the >>> > >> ratio. >>> > >> >> > Take my company's usage as real example, we configure most of >>> the >>> > >> >> topics: 1 >>> > >> >> > day of local retention and 3–7 days of remote storage (3 days >>> for >>> > >> topic >>> > >> >> > with log/metric usage, 7 days for topic with normal business >>> > usage). >>> > >> >> and we >>> > >> >> > don't care about the boot speed and some thing else, This KIP >>> > allows >>> > >> us >>> > >> >> to >>> > >> >> > save 1/7 to 1/3 of the total disk usage for remote storage. >>> > >> >> > >>> > >> >> > Anyway, this is just a topic-level optional feature which don't >>> > >> reject >>> > >> >> the >>> > >> >> > benifit for current design. Thanks again for the discussion. I >>> can >>> > >> >> update >>> > >> >> > the KIP to better classify scenarios where this optional >>> feature is >>> > >> not >>> > >> >> > suitable. Currently, I only listed real-time analytics as the >>> > >> negative >>> > >> >> > example. >>> > >> >> > >>> > >> >> > Welcome further discussion to help make this KIP more complete. >>> > >> Thanks! >>> > >> >> > >>> > >> >> > Regards, >>> > >> >> > Jian >>> > >> >> > >>> > >> >> > Haiying Cai via dev <[email protected]> 于2025年12月1日周一 >>> > 12:40写道: >>> > >> >> > >>> > >> >> > > Jian, >>> > >> >> > > >>> > >> >> > > Thanks for the contribution. But I feel the uploading the >>> local >>> > >> >> segment >>> > >> >> > > file to remote storage ASAP is advantageous in several >>> scenarios: >>> > >> >> > > >>> > >> >> > > 1. Enable the fast bootstrapping a new broker. A new broker >>> > >> doesn’t >>> > >> >> have >>> > >> >> > > to replicate all the data from the leader broker, it only >>> needs >>> > to >>> > >> >> > > replicate the data from the tail of the remote log segment >>> to the >>> > >> >> tail of >>> > >> >> > > the current end of the topic (LSO) since all the other data >>> are >>> > in >>> > >> the >>> > >> >> > > remote tiered storage and it can download them later lazily, >>> this >>> > >> is >>> > >> >> what >>> > >> >> > > KIP-1023 trying to solve; >>> > >> >> > > 2. Although nobody has proposed a KIP to allow a consumer >>> client >>> > to >>> > >> >> read >>> > >> >> > > from the remote tiered storage directly, but this will helps >>> the >>> > >> >> > > fall-behind consumer to do catch-up reads or perform the >>> > backfill. >>> > >> >> This >>> > >> >> > > path allows the consumer backfill to finish without >>> polluting the >>> > >> >> broker’s >>> > >> >> > > page cache. The earlier the data is on the remote tiered >>> > storage, >>> > >> >> the more >>> > >> >> > > advantageous it is for the client. >>> > >> >> > > >>> > >> >> > > I think in your Proposal, you are delaying uploading the >>> segment >>> > >> but >>> > >> >> the >>> > >> >> > > file will still be uploaded at a later time, I guess this can >>> > >> saves a >>> > >> >> few >>> > >> >> > > hours storage cost for that file in the remote storage, not >>> sure >>> > >> >> whether >>> > >> >> > > that is a significant cost saved (if the file needs to stay >>> in >>> > >> remote >>> > >> >> > > tiered storage for several days or weeks due to retention >>> > policy). >>> > >> >> > > >>> > >> >> > > On 2025/11/19 13:29:11 jian fu wrote: >>> > >> >> > > > Hi everyone, I'd like to start a discussion on KIP-1241, >>> the >>> > goal >>> > >> >> is to >>> > >> >> > > > reduce the remote storage. KIP: >>> > >> >> > > > >>> > >> >> > > >>> > >> >> >>> > >> >>> > >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1241%3A+Reduce+tiered+storage+redundancy+with+delayed+upload >>> > >> >> > > > >>> > >> >> > > > The Draft PR: https://github.com/apache/kafka/pull/20913 >>> > >> >> Problem: >>> > >> >> > > > Currently, >>> > >> >> > > > Kafka's tiered storage implementation uploads all >>> non-active >>> > >> local >>> > >> >> log >>> > >> >> > > > segments to remote storage immediately, even when they are >>> > still >>> > >> >> within >>> > >> >> > > the >>> > >> >> > > > local retention period. >>> > >> >> > > > This results in redundant storage of the same data in both >>> > local >>> > >> and >>> > >> >> > > remote >>> > >> >> > > > tiers. >>> > >> >> > > > >>> > >> >> > > > When there is no requirement for real-time analytics or >>> > immediate >>> > >> >> > > > consumption based on remote storage. It has the following >>> > >> drawbacks: >>> > >> >> > > > >>> > >> >> > > > 1. Wastes storage capacity and costs: The same data is >>> stored >>> > >> twice >>> > >> >> > > during >>> > >> >> > > > the local retention window >>> > >> >> > > > 2. Provides no immediate benefit: During the local >>> retention >>> > >> period, >>> > >> >> > > reads >>> > >> >> > > > prioritize local data, making the remote copy unnecessary >>> > >> >> > > > >>> > >> >> > > > >>> > >> >> > > > So. this KIP is to reduce tiered storage redundancy with >>> > delayed >>> > >> >> upload. >>> > >> >> > > > You can check the test result example here directly: >>> > >> >> > > > >>> > >> https://github.com/apache/kafka/pull/20913#issuecomment-3547156286 >>> > >> >> > > > Looking forward to your feedback! Best regards, Jian >>> > >> >> > > > >>> > >> >> > >>> > >> > >>> > >> > >>> > >> >>> > > >>> > > >>> > > >>> > > >>> > >>> >> >> >> >> >> > > > -- Regards Fu.Jian
