Re: [DISCUSS] KIP-1282: Prevent data loss during partition expansion for dynamically added partitions

Chia-Ping Tsai Wed, 22 Apr 2026 12:59:36 -0700

Hi Jun,

Thanks for the feedback. I agree that shifting this policy toward a "Smarter 
Latest" (rather than a better Earliest) is a more elegant path.


The refined behavior would be:

Out-of-range: Strictly follow latest semantics. This ensures a predictable 
"skip to end" behavior when users fall behind retention.

No-offset (Initial Start & Expansion): Leverage Group Creation Time for lookup.

• For new groups, this naturally results in latest behavior since creation time 
is "now".

• For existing groups discovering new partitions, this results in earliest 
behavior for those specific partitions.

Group GC: If a group is purged, it is treated as a brand-new group with a 
creation time of "now," consistently skipping to the end.

WDYT?


> Jun Rao via dev <[email protected]> 於 2026年4月23日 凌晨1:34 寫道：
> 
> Hi, Chia-Ping,
> 
> Thanks for the reply.
> 
> Let's try to understand from the user's perspective. When the user starts
> the group for the first time, it faces a choice on whether to process the
> backlog or not. When the offset is out-of-range, the user faces the same
> choice regarding backlog processing. It seems that most users want to make
> the same choice regarding backlog processing.
> 
> "Users who explicitly choose the to_start_time policy do so precisely
> because they do not want to skip any records when encountering an
> out-of-range scenario."
> This argument is weak because that's how to_start_time is designed, but we
> need to justify why it is a good choice in the first place.
> 
> Jun
> 
>> On Tue, Apr 21, 2026 at 12:35 PM Chia-Ping Tsai <[email protected]> wrote:
>> 
>> Hi Jun,
>> 
>> Thanks for the clarification. I think I misunderstood your previous point.
>> Let me summarize the scenarios to ensure we are fully aligned.
>> 
>> There are essentially three scenarios when a consumer needs to reset
>> offsets:
>> 
>>   1.
>> 
>>   Out-of-range (The group exists, but the offset has expired).
>>   2.
>> 
>>   Extended partition (The group exists, but encounters a newly added
>>   partition with no committed offset).
>>   3.
>> 
>>   No-offset (The group is completely new, or an existing group was
>>   deleted by the GC).
>> 
>> We all agree that the primary goal of this KIP is to catch up on all
>> records for scenario 2. There are no objections here.
>> 
>> Regarding the inconsistency you pointed out between 1) and 3) under the
>> current to_start_time design, I completely see your point. If users are
>> not fully aware that to_start_time is designed to read all records since
>> the creation of the group, they might get confused.
>> 
>> However, to me, this "inconsistency" is actually a matter of
>> predictability. Users who explicitly choose the to_start_time policy do
>> so precisely because they do not want to skip any records when encountering
>> an out-of-range scenario.
>> 
>> (I would prefer to set aside the topic of group GC for a moment. It is
>> much more important that we first focus our discussion on the
>> "out-of-range" scenario)
>> 
>> Best,
>> 
>> Chia-Ping
>> 
>> Jun Rao via dev <[email protected]> 於 2026年4月22日週三 上午1:13寫道：
>> 
>>> Hi, Chia-Ping,
>>> 
>>> Hmm, is that true? With the earliest policy, we treat an out-of-range
>>> offset the same as no offset (because the group is deleted) and always set
>>> it to the earliest offset, right? With to_start_time, an out-of-range
>>> offset is treated differently from no offset.
>>> 
>>> Thanks,
>>> 
>>> Jun
>>> 
>>> On Tue, Apr 21, 2026 at 12:54 AM Chia-Ping Tsai <[email protected]>
>>> wrote:
>>> 
>>>> hi Jun
>>>> 
>>>> Nice point. Group GC is definitely an issue for to_start_time, but it is
>>>> actually an issue for other policies as well.
>>>> 
>>>> For example, a consumer using the earliest policy will suddenly read all
>>>> historical records from scratch if it sleeps for a long while and gets
>>>> GC'd; otherwise, it just resumes from previous offsets if the group
>>> still
>>>> exists. It is equally hard to explain to users: "Oh, your group was
>>> GC'd,
>>>> so your offset behavior changed."
>>>> 
>>>> Therefore, it seems to me the right approach to fix this "inconsistency"
>>>> is to offer a group-level GC timeout in a future KIP, allowing users to
>>>> explicitly protect critical groups from GC. This saves not only
>>>> to_start_time, but all other reset policies too.
>>>> 
>>>> Best,
>>>> Chia-Ping
>>>> 
>>>> On 2026/04/20 20:19:47 Jun Rao via dev wrote:
>>>>> Hi, Jiunn-Yang and Chia-Ping,
>>>>> 
>>>>> Thanks for the reply.
>>>>> 
>>>>> The main concern I see with to_start_time is that its behavoir on how
>>>> much
>>>>> data to consume when the offset is out of range is not consistent and
>>> is
>>>>> hard to explain. If the group still exists, it will read from the
>>>> earliest
>>>>> offset. Otherwise, it will read from the latest.
>>>>> 
>>>>> Jun
>>>>> 
>>>>> On Mon, Apr 20, 2026 at 10:13 AM Chia-Ping Tsai <[email protected]>
>>>> wrote:
>>>>> 
>>>>>> hi all,
>>>>>> 
>>>>>> Just a note for a potential latest_v2:
>>>>>> 
>>>>>> Since the purpose is to read all records from extended partitions,
>>> we
>>>>>> could leverage the group creation time to compare against the
>>> earliest
>>>>>> record of a partition when there is no committed offset. If the
>>> group
>>>>>> creation time is larger than the earliest record's timestamp, we
>>>> assume it
>>>>>> is not an extended partition. Otherwise, we treat it as an extended
>>>>>> partition.
>>>>>> 
>>>>>> This approach allows us to catch all "possible" extended partitions,
>>>> which
>>>>>> includes both "true" extended partitions and old but truncated
>>>> partitions.
>>>>>> While there is a rare edge case where the cost is reprocessing some
>>>> records
>>>>>> we don't necessarily want, it is very easy to implement and
>>> guarantees
>>>> we
>>>>>> will never miss the actual extended partitions.
>>>>>> 
>>>>>> Best,
>>>>>> Chia-Ping
>>>>>> 
>>>>>> On 2026/04/20 13:33:31 黃竣陽 wrote:
>>>>>>> Hello all,
>>>>>>> 
>>>>>>> I have added a new "Future Work: latest_strict Policy" section to
>>> the
>>>>>> KIP.
>>>>>>> The idea is a future policy that uses latest semantics by default
>>> but
>>>>>> falls
>>>>>>> back to the group creation timestamp specifically for newly added
>>>>>> partitions
>>>>>>> during partition expansion. This would reuse the group creation
>>> time
>>>>>> anchor
>>>>>>> introduced by this KIP, making it a natural extension with minimal
>>>>>> additional
>>>>>>> protocol changes.
>>>>>>> 
>>>>>>> Best Regards,
>>>>>>> Jiunn-Yang
>>>>>>> 
>>>>>>>> Chia-Ping Tsai <[email protected]> 於 2026年4月18日 下午4:09 寫道：
>>>>>>>> 
>>>>>>>> Hi all,
>>>>>>>> 
>>>>>>>> It is practically NP-hard to guess everyone's ideal use case
>>> right
>>>> now.
>>>>>>>> Also, I believe we all want to avoid falling back to the
>>> intricate
>>>>>>>> multi-policy approach proposed in KIP-842.
>>>>>>>> 
>>>>>>>> I prefer to keep this KIP focused and discuss a "v2 latest"
>>> policy
>>>> in a
>>>>>>>> separate KIP. That future policy could build upon the
>>> to_start_time
>>>>>> anchor
>>>>>>>> to fix data loss specifically for extended partitions. We could
>>>> call it
>>>>>>>> something like latest_strict.
>>>>>>>> 
>>>>>>>> Thoughts?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 黃竣陽 <[email protected]> 於 2026年4月18日週六 下午3:24寫道：
>>>>>>>> 
>>>>>>>>> Hello Jun,
>>>>>>>>> 
>>>>>>>>> Thanks for the reply,
>>>>>>>>> 
>>>>>>>>> When the offset goes out of range, the user faces two options:
>>>>>>>>> 
>>>>>>>>> 1. Skip to the end (latest behavior) — risk losing data that
>>> was
>>>>>> produced
>>>>>>>>> during
>>>>>>>>> the group's lifetime but not yet consumed.
>>>>>>>>> 2. Seek back to the group creation time (to_start_time
>>> behavior) —
>>>>>>>>> potentially
>>>>>>>>> reprocess some data, but guarantee no data from the group's
>>>> lifetime
>>>>>> is
>>>>>>>>> silently lost.
>>>>>>>>> 
>>>>>>>>> to_start_time chooses option 2 because its core promise is
>>> "never
>>>>>> silently
>>>>>>>>> lose data
>>>>>>>>> produced after the group started." If we fell back to latest on
>>>>>>>>> out-of-range, we would
>>>>>>>>> break this guarantee.
>>>>>>>>> 
>>>>>>>>> I consider users who prefer option 1 can simply use
>>>>>>>>> auto.offset.reset=latest.
>>>>>>>>> 
>>>>>>>>> Best Regards,
>>>>>>>>> Jiunn-Yang
>>>>>>>>> 
>>>>>>>>>> Jun Rao via dev <[email protected]> 於 2026年4月18日 凌晨1:57
>>> 寫道：
>>>>>>>>>> 
>>>>>>>>>> Hi, Jiunn-Yang and Chia-Ping,
>>>>>>>>>> 
>>>>>>>>>> Thanks for the reply.
>>>>>>>>>> 
>>>>>>>>>> "The core semantic of to_start_time is to read all records
>>> since
>>>> the
>>>>>>>>>> creation of the group."
>>>>>>>>>> 
>>>>>>>>>> I am just questioning whether this actually covers a common
>>> use
>>>>>> case. If
>>>>>>>>>> the offset doesn't go out of range, the logic makes sense to
>>> me.
>>>> I'm
>>>>>> not
>>>>>>>>>> sure about the logic if the offset is out of range. If a user
>>>>>> chooses to
>>>>>>>>>> skip the historical data when starting the group, it seems the
>>>> user
>>>>>>>>> likely
>>>>>>>>>> wants to do the same if the offset is out of range.
>>>>>>>>>> 
>>>>>>>>>> Jun
>>>>>>>>>> 
>>>>>>>>>> On Fri, Apr 17, 2026 at 5:23 AM 黃竣陽 <[email protected]>
>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hello Jun,
>>>>>>>>>>> 
>>>>>>>>>>> Thank for the feedback,
>>>>>>>>>>> 
>>>>>>>>>>> Adding to the points above:
>>>>>>>>>>> 
>>>>>>>>>>> Regarding by_duration as an alternative to Scenario 1: beyond
>>>> clock
>>>>>> skew
>>>>>>>>>>> and retry issues, there is also a usability concern.
>>> by_duration
>>>>>>>>> requires
>>>>>>>>>>> users
>>>>>>>>>>> to reason about operational timing — "how long does partition
>>>>>> discovery
>>>>>>>>>>> take
>>>>>>>>>>> in my environment?”, and then translate that into a
>>>> configuration
>>>>>> value.
>>>>>>>>>>> to_start_time
>>>>>>>>>>> requires no such reasoning. It simply anchors to the group
>>>> creation
>>>>>> time
>>>>>>>>>>> recorded
>>>>>>>>>>> by the broker.
>>>>>>>>>>> 
>>>>>>>>>>> Regarding Scenario 2: I'd also like to clarify that
>>>> to_start_time
>>>>>> does
>>>>>>>>> not
>>>>>>>>>>> branch between
>>>>>>>>>>> "use latest" and "use earliest." It applies the same
>>>>>> ListOffsetsRequest
>>>>>>>>>>> with the group creation
>>>>>>>>>>> timestamp in all cases. The difference in outcome:
>>>>>>>>>>> - skipping old data on first start
>>>>>>>>>>> - consuming surviving data after truncation
>>>>>>>>>>> is a natural consequence of what data exists in the
>>> partition at
>>>>>> that
>>>>>>>>>>> point, not a different policy
>>>>>>>>>>> being applied. The rule is always the same.
>>>>>>>>>>> 
>>>>>>>>>>> Best Regards,
>>>>>>>>>>> Jiunn-Yang
>>>>>>>>>>> 
>>>>>>>>>>>> Chia-Ping Tsai <[email protected]> 於 2026年4月17日 上午9:48 寫道：
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> Jun Rao via dev <[email protected]> 於 2026年4月17日 凌晨4:57
>>>> 寫道：
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Also, a group is deleted after the consumer has been idle
>>>> longer
>>>>>>>>>>>>> than offsets.retention.minutes. What's the semantic of
>>>>>> to_start_time
>>>>>>>>> if
>>>>>>>>>>> the
>>>>>>>>>>>>> group creation time is unavailable?
>>>>>>>>>>>> 
>>>>>>>>>>>> If the group is recreated, a new creation time will be
>>>> recorded.
>>>>>> Hence,
>>>>>>>>>>> it acts like a new group. Plus, it throws an exception
>>> directly
>>>> if
>>>>>> the
>>>>>>>>>>> group truly has no creation time.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>>

Re: [DISCUSS] KIP-1282: Prevent data loss during partition expansion for dynamically added partitions

Reply via email to