Hey Kamal,

Some additional points about the Q4,

> The user can decide when to change their internal topic cleanup policy to
> compact. If someone retains
> the data in the remote storage for 3 months, then they can migrate to the
> compacted topic after 3 months
> post rolling out this change. And, update their cleanup policy to [compact,
> delete].


I don't think it's a good idea to keep delete in the final cleanup policy
for the topic `__remote_log_metadata`, as this still requires the user to
keep track of the max retention hours of topics that have remote storage
enabled, and it's operational burden. It's also hard to reason about what
will happen if the user configures the wrong retention.ms. I hope this
makes sense.


Thanks,
Lijun Tong

Lijun Tong <[email protected]> 于2026年1月15日周四 11:43写道:

> Hey Kamal,
>
> Thanks for your reply! I am glad we are on the same page with making the
> __remote_log_metadata topic compacted optional for the user now, I will
> update the KIP with this change.
>
> For the Q2:
> With the key designed as TopicId:Partition:EndOffset:BrokerLeaderEpoch,
> even the same broker retries the upload multiple times for the same log
> segment, the latest retry attempt with the latest segment UUID will
> overwrite the previous attempts' value since they share the same key, so we
> don't need to explicitly track the failed upload metadata, because it's
> gone already by the later attempt. That's my understanding about the
> RLMCopyTask, correct me if I am wrong.
>
> Thanks,
> Lijun Tong
>
> Kamal Chandraprakash <[email protected]> 于2026年1月14日周三
> 21:18写道:
>
>> Hi Lijun,
>>
>> Thanks for the reply!
>>
>> Q1: Sounds good. Could you clarify it in the KIP that the same partitioner
>> will be used?
>>
>> Q2: With TopicId:Partition:EndOffset:BrokerLeaderEpoch key, if the same
>> broker retries the upload due to intermittent
>> issues in object storage (or) RLMM. Then, those failed upload metadata
>> also
>> need to be cleared.
>>
>> Q3: We may have to skip the null value records in the ConsumerTask.
>>
>> Q4a: The idea is to keep the cleanup policy as "delete" and also send the
>> tombstone markers
>> to the existing `__remote_log_metadata` topic. And, handle the tombstone
>> records in the ConsumerTask.
>>
>> The user can decide when to change their internal topic cleanup policy to
>> compact. If someone retains
>> the data in the remote storage for 3 months, then they can migrate to the
>> compacted topic after 3 months
>> post rolling out this change. And, update their cleanup policy to
>> [compact,
>> delete].
>>
>> Thanks,
>> Kamal
>>
>> On Thu, Jan 15, 2026 at 4:12 AM Lijun Tong <[email protected]>
>> wrote:
>>
>> > Hey Jian,
>> >
>> > Thanks for your time to review this KIP. I appreciate that you propose a
>> > simpler migration solution to onboard the new feature.
>> >
>> > There are 2 points that I think can be further refined on:
>> >
>> > 1). make the topic compacted optional, although the new feature will
>> > continue to emit tombstone message for those expired log segments even
>> when
>> > the topic is still on time-based retention mode, so once user switched
>> to
>> > using the compacted topic, those expired messages can still be deleted
>> > despite the topic is not retention based anymore.
>> > 2). we need to expose some flag to the user to indicate whether the
>> topic
>> > can be flipped to compacted by checking whether all the old format
>> > keyed-less message has expired, and allow user to choose to flip to
>> > compacted only when the flag is true.
>> >
>> > Thanks for sharing your idea. I will update the KIP later with this new
>> > idea.
>> >
>> > Best,
>> > Lijun Tong
>> >
>> >
>> > jian fu <[email protected]> 于2026年1月12日周一 04:55写道:
>> >
>> > > Hi  Lijun Tong:
>> > >
>> > > Thanks for your KIP which raise this critical issue.
>> > >
>> > > what about just keep one topic instead of involve another topic.
>> > > for existed topic data's migration. maybe we can use this way to solve
>> > the
>> > > issue:
>> > > (1) set the retention date > all of topic which enable remote
>> storage's
>> > > retention time
>> > > (2) deploy new kafka version with feature:  which send the message
>> with
>> > key
>> > > (3) wait all the message expired and new message with key coming to
>> the
>> > > topic
>> > > (4) convert the topic to compact
>> > >
>> > > I don't test it. Just propose this solution according to code review
>> > > result.  just for your reference.
>> > > The steps maybe a little complex. but it can avoiding add new topic.
>> > >
>> > > Regards
>> > > Jian
>> > >
>> > > Lijun Tong <[email protected]> 于2026年1月8日周四 09:17写道:
>> > >
>> > > > Hey Kamal,
>> > > >
>> > > >
>> > > > Thanks for your time for the review.
>> > > >
>> > > >
>> > > > Here is my response to your questions:
>> > > >
>> > > > Q1 At this point, I don’t see a need to change
>> > > > RemoteLogMetadataTopicPartitioner for this design. Nothing in the
>> > current
>> > > > approach appears to require a partitioner change, but I’m open to
>> > > > revisiting if a concrete need arises.
>> > > >
>> > > > Q2 I have some reservations about using SegmentId:State as the key.
>> A
>> > > > practical challenge we see today is that the same logical segment
>> can
>> > be
>> > > > retried multiple times with different SegmentIds across brokers. If
>> the
>> > > key
>> > > > is SegmentId-based, it becomes harder to discover and tombstone all
>> > > related
>> > > > attempts when the segment eventually expires. The
>> > > > TopicId:Partition:EndOffset:BrokerLeaderEpoch key is deterministic
>> for
>> > a
>> > > > logical segment attempt and helps group retries by epoch, which
>> > > simplifies
>> > > > cleanup and reasoning about state. I’d love to understand the
>> benefits
>> > > > you’re seeing with SegmentId:State compared to the
>> offset/epoch-based
>> > key
>> > > > so we can weigh the trade-offs.
>> > > >
>> > > > On partitioning: with this proposal, all states for a given user
>> > > > topic-partition still map to the same metadata partition. That
>> remains
>> > > true
>> > > > for the existing __remote_log_metadata (unchanged partitioner) and
>> for
>> > > the
>> > > > new __remote_log_metadata_compacted, preserving the properties
>> > > > RemoteMetadataCache relies on.
>> > > >
>> > > > Q3 It should be fine for ConsumerTask to ignore tombstone records
>> (null
>> > > > values) and no-op.
>> > > >
>> > > > Q4 Although TBRLMM is a sample RLMM implementation, it’s currently
>> the
>> > > only
>> > > > OSS option and is widely used. The new
>> __remote_log_metadata_compacted
>> > > > topic offers clear operational benefits in that context. We can also
>> > > > provide a configuration to let users choose whether they want to
>> keep
>> > the
>> > > > audit topic (__remote_log_metadata) in their cluster.
>> > > >
>> > > > Q4a Enabling compaction on __remote_log_metadata alone may not fully
>> > > > address the unbounded growth, since we also need to emit tombstones
>> for
>> > > > expired keys to delete them. Deferring compaction and tombstoning to
>> > user
>> > > > configuration could make the code flow complicated, also add
>> > operational
>> > > > complexity and make outcomes less predictable. The proposal aims to
>> > > provide
>> > > > a consistent experience by defining deterministic keys and emitting
>> > > > tombstones as part of the broker’s responsibilities, while still
>> > allowing
>> > > > users to opt out of the audit topic if they prefer. But I am open to
>> > more
>> > > > discussion if there is any concrete need I don't foresee.
>> > > >
>> > > >
>> > > > Thanks,
>> > > >
>> > > > Lijun Tong
>> > > >
>> > > > Kamal Chandraprakash <[email protected]> 于2026年1月6日周二
>> > > > 01:01写道:
>> > > >
>> > > > > Hi Lijun,
>> > > > >
>> > > > > Thanks for the KIP! Went over the first pass.
>> > > > >
>> > > > > Few Questions:
>> > > > >
>> > > > > 1. Are we going to maintain the same
>> > RemoteLogMetadataTopicPartitioner
>> > > > > <
>> > > > >
>> > > >
>> > >
>> >
>> https://sourcegraph.com/github.com/apache/kafka/-/blob/storage/src/main/java/org/apache/kafka/server/log/remote/metadata/storage/RemoteLogMetadataTopicPartitioner.java
>> > > > > >
>> > > > > for both the topics? It is not clear in the KIP, could you clarify
>> > it?
>> > > > > 2. Can the key be changed to SegmentId:State instead of
>> > > > > TopicId:Partition:EndOffset:BrokerLeaderEpoch if the same
>> partitioner
>> > > is
>> > > > > used? It is good to maintain all the segment states for a
>> > > > > user-topic-partition in the same metadata partition.
>> > > > > 3. Should we have to handle the records with null value
>> (tombstone)
>> > in
>> > > > the
>> > > > > ConsumerTask
>> > > > > <
>> > > > >
>> > > >
>> > >
>> >
>> https://sourcegraph.com/github.com/apache/kafka/-/blob/storage/src/main/java/org/apache/kafka/server/log/remote/metadata/storage/ConsumerTask.java?L166
>> > > > > >
>> > > > > ?
>> > > > > 4. TBRLMM
>> > > > > <
>> > > > >
>> > > >
>> > >
>> >
>> https://sourcegraph.com/github.com/apache/kafka/-/blob/storage/src/main/java/org/apache/kafka/server/log/remote/metadata/storage/TopicBasedRemoteLogMetadataManager.java
>> > > > > >
>> > > > > is a sample plugin implementation of RLMM. Not sure whether the
>> > > community
>> > > > > will agree to add one more internal topic for this plugin impl.
>> > > > > 4a. Can we modify the new messages to the __remote_log_metadata
>> topic
>> > > to
>> > > > > contain the key and leave it to the user to enable compaction for
>> > this
>> > > > > topic if they need?
>> > > > >
>> > > > > Thanks,
>> > > > > Kamal
>> > > > >
>> > > > > On Tue, Jan 6, 2026 at 7:35 AM Lijun Tong <
>> [email protected]>
>> > > > wrote:
>> > > > >
>> > > > > > Hey Henry,
>> > > > > >
>> > > > > > Thank you for your time and response! I really like your
>> KIP-1248
>> > > about
>> > > > > > offloading the consumption of remote log away from the broker,
>> and
>> > I
>> > > > > think
>> > > > > > with that change, the topic that enables the tiered storage can
>> > also
>> > > > have
>> > > > > > longer retention configurations and would benefit from this KIP
>> > too.
>> > > > > >
>> > > > > > Some suggestions: In your example scenarios, it would also be
>> good
>> > to
>> > > > add
>> > > > > > > an example of remote log segment deletion triggered by
>> retention
>> > > > policy
>> > > > > > > which will trigger generation of tombstone event into metadata
>> > > topic
>> > > > > and
>> > > > > > > trigger log compaction/deletion 24 hour later, I think this is
>> > the
>> > > > key
>> > > > > > > event to cap the metadata topic size.
>> > > > > >
>> > > > > >
>> > > > > > Regarding to this suggestion, I am not sure whether Scenario 4
>> > > > > > <
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406618613#KIP1266:BoundingTheNumberOfRemoteLogMetadataMessagesviaCompactedTopic-Scenario4:SegmentDeletion
>> > > > > > >
>> > > > > > has
>> > > > > > covered it. I can add more rows in the Timeline Table like
>> > T5+24hour
>> > > to
>> > > > > > indicate the messages are gone by then to explicitly show that
>> > > messages
>> > > > > are
>> > > > > > deleted, thus the number of messages are capped in the topic.
>> > > > > >
>> > > > > > Regarding whether the topic __remote_log_metadata is still
>> > > necessary, I
>> > > > > am
>> > > > > > inclined to continue to have this topic at least for debugging
>> > > purposes
>> > > > > so
>> > > > > > we can build confidence about the compacted topic change, we can
>> > > > > > always choose to remove this topic in the future once we all
>> agree
>> > it
>> > > > > > provides limited value for the users.
>> > > > > >
>> > > > > > Thanks,
>> > > > > > Lijun Tong
>> > > > > >
>> > > > > >
>> > > > > > Henry Haiying Cai via dev <[email protected]> 于2026年1月5日周一
>> > > 16:19写道:
>> > > > > >
>> > > > > > > Lijun,
>> > > > > > >
>> > > > > > > Thanks for the proposal and I liked your idea of using a
>> > compacted
>> > > > > topic
>> > > > > > > for tiered storage metadata topic.
>> > > > > > >
>> > > > > > > In our setup, we have set a shorter retention (3 days) for the
>> > > tiered
>> > > > > > > storage metadata topic to control the size growth.  We can do
>> > that
>> > > > > since
>> > > > > > we
>> > > > > > > control all topic's retention policy in our clusters and we
>> set a
>> > > > > uniform
>> > > > > > > retention.policy for all our tiered storage topics.  I can see
>> > > other
>> > > > > > > users/companies will not be able to enforce that retention
>> policy
>> > > to
>> > > > > all
>> > > > > > > tiered storage topics.
>> > > > > > >
>> > > > > > > Some suggestions: In your example scenarios, it would also be
>> > good
>> > > to
>> > > > > add
>> > > > > > > an example of remote log segment deletion triggered by
>> retention
>> > > > policy
>> > > > > > > which will trigger generation of tombstone event into metadata
>> > > topic
>> > > > > and
>> > > > > > > trigger log compaction/deletion 24 hour later, I think this is
>> > the
>> > > > key
>> > > > > > > event to cap the metadata topic size.
>> > > > > > >
>> > > > > > > For the original unbounded remote_log_metadata topic, I am not
>> > sure
>> > > > > > > whether we still need it or not.  If it is left only for audit
>> > > trail
>> > > > > > > purpose, people can set up a data ingestion pipeline to ingest
>> > the
>> > > > > > content
>> > > > > > > of metadata topic into a separate storage location.  I think
>> we
>> > can
>> > > > > have
>> > > > > > a
>> > > > > > > flag to have only one metadata topic (the compacted version).
>> > > > > > >
>> > > > > > >
>> > > > > > > On Monday, January 5, 2026 at 01:22:42 PM PST, Lijun Tong <
>> > > > > > > [email protected]> wrote:
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > Hello Kafka Community,
>> > > > > > >
>> > > > > > > I would like to start a discussion on KIP-1266, which
>> proposes to
>> > > add
>> > > > > > > another new compacted remote log metadata topic for the tiered
>> > > > storage,
>> > > > > > to
>> > > > > > > limit the number of messages that need to be iterated to build
>> > the
>> > > > > remote
>> > > > > > > metadata state.
>> > > > > > >
>> > > > > > > KIP link: KIP-1266 Bounding The Number Of RemoteLogMetadata
>> > > Messages
>> > > > > via
>> > > > > > > Compacted RemoteLogMetadata Topic
>> > > > > > > <
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1266%3A+Bounding+The+Number+Of+RemoteLogMetadata+Messages+via+Compacted+Topic
>> > > > > > > >
>> > > > > > >
>> > > > > > > Background:
>> > > > > > > The current Tiered Storage implementation uses a
>> > > > __remote_log_metadata
>> > > > > > > topic with infinite retention and delete-based cleanup policy,
>> > > > causing
>> > > > > > > unbounded growth, slow broker bootstrap, no mechanism to
>> clean up
>> > > > > expired
>> > > > > > > segment metadata, and inefficient re-reading from offset 0
>> during
>> > > > > > > leadership changes.
>> > > > > > >
>> > > > > > > Proposal:
>> > > > > > > A dual-topic approach that introduces a new
>> > > > > > __remote_log_metadata_compacted
>> > > > > > > topic using log compaction with deterministic offset-based
>> keys,
>> > > > while
>> > > > > > > preserving the existing topic for audit history; this allows
>> > > brokers
>> > > > to
>> > > > > > > build their metadata cache exclusively from the compacted
>> topic,
>> > > > > enables
>> > > > > > > cleanup of expired segment metadata through tombstones, and
>> > > includes
>> > > > a
>> > > > > > > migration strategy to populate the new topic during
>> > > > upgrade—delivering
>> > > > > > > bounded metadata growth and faster broker startup while
>> > maintaining
>> > > > > > > backward compatibility.
>> > > > > > >
>> > > > > > > More details are in the attached KIP link.
>> > > > > > > Looking forward to your thoughts.
>> > > > > > >
>> > > > > > > Thank you for your time!
>> > > > > > >
>> > > > > > > Best,
>> > > > > > > Lijun Tong
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Reply via email to