Re: [DISCUSS] KIP-1259: Add configuration to wipe Kafka Streams local state on startup

Uladzislau Blok Fri, 13 Feb 2026 13:59:14 -0800

Yes, I agree with that. I have changed a KIP accordingly.

One small details I want to fix is naming for new property:
The proposal was: state.cleanup.on.start.delay.ms , but at the same time is
not 'delay' is more 'max age' of directory
I propose a bit different naming: state.cleanup.dir.max.age.ms
This naming, from my perspective, better describes new functionality, but
at the same time shows the relation to 'state.cleanup.delay.ms'


KIP is updated

On Fri, Feb 13, 2026 at 10:22 PM Matthias J. Sax <[email protected]> wrote:

> One more thing came to mind. We should also define the "importance
> level" of the new config.
>
> Given that's it's kind of an edge case, maybe "low" would be the right
> pick? Thoughts?
>
> The KIP should include this information.
>
>
> -Matthias
>
> On 2/13/26 11:56 AM, Matthias J. Sax wrote:
> > Thanks for updating the KIP. LGTM
> >
> > I don't have any further comments. While we wait for others to chime in,
> > too, I think you can start a vote in parallel.
> >
> >
> > -Matthias
> >
> > On 1/28/26 4:42 AM, Uladzislau Blok wrote:
> >> Hello Matthias,
> >>
> >> Thank you for the feedback.
> >>
> >> I really like the proposal to change state.cleanup.on.start from a
> >> boolean
> >> to a long (with a default of -1). Do we need to change naming then?
> >> Proposal: state.cleanup.on.start.delay.ms
> >> Decoupling this from state.cleanup.delay.ms ensures the new feature
> >> doesn't
> >> have unintended side effects. It also gives users the flexibility to
> >> align
> >> the cleanup threshold with their delete.retention.ms settings. For
> >> example,
> >> if the retention is set to 24 hours, a user could safely set the cleanup
> >> property to 20 hours (or even closer to retention value)
> >>
> >> Regarding the global store case, I believe this approach helps there as
> >> well. Even if a less-frequently updated global store is wiped, it would
> >> only occur according to the specific threshold the user has defined,
> >> which
> >> is a manageable trade-off.
> >>
> >> I have updated the KIP accordingly.
> >>
> >> Best regards,
> >> Uladzislau Blok
> >>
> >> On Tue, Jan 27, 2026 at 8:19 AM Matthias J. Sax <[email protected]>
> wrote:
> >>
> >>> Thanks for raising both points.
> >>>
> >>> The global store one is tricky. Not sure atm. The good thing is of
> >>> course, that this new feature is disable by default. Maybe it would be
> >>> sufficient to call out this edge case in the docs explicitly, calling
> >>> for caution, but leave it up the user to decide? -- Maybe others have
> >>> some ideas?
> >>>
> >>>
> >>> About increasing `state.cleanup.delay.ms` -- I am not convinced it
> would
> >>> be a good idea. I would propose two alternatives.
> >>>
> >>>    - extend the doc to tell users to consider increasing this config,
> if
> >>> they use this new feature
> >>>
> >>>    - change `state.cleanup.on.start` from a boolean to a long, with
> >>> default value `-1` (for disabled) and let users decide what age
> >>> threshold they want to apply when enabling the feature, effectively
> >>> decoupling the new feature from `state.cleanup.delay.ms` config.
> >>>
> >>> Thoughts?
> >>>
> >>>
> >>> -Matthias
> >>>
> >>> On 1/18/26 11:01 AM, Uladzislau Blok wrote:
> >>>> Hello Matthias,
> >>>>
> >>>> Thanks for the feedback on the KIP.
> >>>>
> >>>> It seems we had a slight misunderstanding regarding the cleanup logic,
> >>> but
> >>>> after revisiting the ticket and the existing codebase, your
> >>>> suggestion to
> >>>> wipe stores older than state.cleanup.delay.ms makes perfect sense. I
> >>> have
> >>>> updated the KIP accordingly, and it is now ready for a second round of
> >>>> review.
> >>>>
> >>>> I would like to highlight two specific points for further discussion:
> >>>>
> >>>>      -
> >>>>
> >>>>      This proposal might cause global stores to be deleted if they
> >>>> aren't
> >>>>      updated often. Currently, we check the last modification time
> >>>> of the
> >>>>      directory. If a global table hasn't changed, it might be
> >>>> cleaned up
> >>> even if
> >>>>      the data is still valid. However, since these tables are usually
> >>> small,
> >>>>      this might not be a major issue. What do you think?
> >>>>      -
> >>>>
> >>>>      We previously discussed increasing the default value for
> >>>>      state.cleanup.delay.ms to be less aggressive. Do we have any
> >>> consensus
> >>>>      on a reasonable default, or a recommended methodology for
> >>>> measuring
> >>> what
> >>>>      this value should be?
> >>>>
> >>>> Regards,
> >>>> Uladzislau Blok.
> >>>>
> >>>> On Mon, Jan 12, 2026 at 2:55 AM Matthias J. Sax <[email protected]>
> >>> wrote:
> >>>>
> >>>>> Thanks for the KIP Uladzislau.
> >>>>>
> >>>>> Given that you propose to wipe the entire state if this config is
> >>>>> set, I
> >>>>> am wondering if we would need such a config to begin with, or if
> users
> >>>>> could implement this themselves (via some custom config the
> >>>>> application
> >>>>> code uses) and calls `KafkaStreams#cleanUp()` to wipe out all local
> >>>>> state if this custom config is set?
> >>>>>
> >>>>> I believe to remember from the original ticket discussion, that the
> >>>>> idea
> >>>>> was not to blindly wipe the entire state, but to do it still based on
> >>>>> task directory age, similar to what the background cleaner thread
> does
> >>>>> (based on `state.cleanup.delay.ms` config). And to trigger a cleanup
> >>> run
> >>>>> before startup. Thoughts?
> >>>>>
> >>>>>
> >>>>> -Matthias
> >>>>>
> >>>>> On 12/21/25 6:37 AM, Uladzislau Blok wrote:
> >>>>>> Hi everyone,
> >>>>>>
> >>>>>> I'd like to start a discussion on *KIP-1259: Add configuration to
> >>>>>> wipe
> >>>>>> local state on startup*.
> >>>>>> Problem
> >>>>>>
> >>>>>> Currently, Kafka Streams can encounter a "zombie data" issue when an
> >>>>>> instance restarts using stale local files after a period exceeding
> >>>>>> the
> >>>>>> changelog topic's delete.retention.ms. If the local checkpoint
> offset
> >>> is
> >>>>>> still within the broker's available log range (due to long-lived
> >>>>> entities),
> >>>>>> an automatic reset isn't triggered. However, since the broker has
> >>> already
> >>>>>> purged deletion tombstones, the state store is rehydrated without
> the
> >>>>>> "delete" instructions, causing previously deleted entities to
> >>>>> unexpectedly
> >>>>>> reappear in the local RocksDB.
> >>>>>> Proposed Solution
> >>>>>>
> >>>>>> I propose introducing a new configuration, state.cleanup.on.start
> >>>>> (Boolean,
> >>>>>> default: false). When enabled, this property forces the deletion
> >>>>>> of all
> >>>>>> local state directories and checkpoint files during application
> >>>>>> initialization. This ensures the state is rebuilt entirely from the
> >>>>>> changelog—the broker's "source of truth"—effectively purging any
> >>> expired
> >>>>>> zombie records.
> >>>>>>
> >>>>>> This is particularly useful for environments with persistent volumes
> >>>>> where
> >>>>>> instances might remain dormant for long periods (e.g., multi-region
> >>>>>> failover).
> >>>>>>
> >>>>>> *KIP Link: *
> >>>>>>
> >>>>>
> >>> https://cwiki.apache.org/confluence/display/KAFKA/
> >>>
> KIP-1259%3A+Add+configuration+to+wipe+Kafka+Streams+local+state+on+startup
> >>>>>>
> >>>>>>
> >>>>>> I look forward to your feedback and suggestions.
> >>>>>>
> >>>>>>
> >>>>>> Best regards,
> >>>>>>
> >>>>>> Uladzislau Blok
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>
> >
>
>

Re: [DISCUSS] KIP-1259: Add configuration to wipe Kafka Streams local state on startup

Reply via email to