Yes, I agree with that. I have changed a KIP accordingly. One small details I want to fix is naming for new property: The proposal was: state.cleanup.on.start.delay.ms , but at the same time is not 'delay' is more 'max age' of directory I propose a bit different naming: state.cleanup.dir.max.age.ms This naming, from my perspective, better describes new functionality, but at the same time shows the relation to 'state.cleanup.delay.ms'
KIP is updated On Fri, Feb 13, 2026 at 10:22 PM Matthias J. Sax <[email protected]> wrote: > One more thing came to mind. We should also define the "importance > level" of the new config. > > Given that's it's kind of an edge case, maybe "low" would be the right > pick? Thoughts? > > The KIP should include this information. > > > -Matthias > > On 2/13/26 11:56 AM, Matthias J. Sax wrote: > > Thanks for updating the KIP. LGTM > > > > I don't have any further comments. While we wait for others to chime in, > > too, I think you can start a vote in parallel. > > > > > > -Matthias > > > > On 1/28/26 4:42 AM, Uladzislau Blok wrote: > >> Hello Matthias, > >> > >> Thank you for the feedback. > >> > >> I really like the proposal to change state.cleanup.on.start from a > >> boolean > >> to a long (with a default of -1). Do we need to change naming then? > >> Proposal: state.cleanup.on.start.delay.ms > >> Decoupling this from state.cleanup.delay.ms ensures the new feature > >> doesn't > >> have unintended side effects. It also gives users the flexibility to > >> align > >> the cleanup threshold with their delete.retention.ms settings. For > >> example, > >> if the retention is set to 24 hours, a user could safely set the cleanup > >> property to 20 hours (or even closer to retention value) > >> > >> Regarding the global store case, I believe this approach helps there as > >> well. Even if a less-frequently updated global store is wiped, it would > >> only occur according to the specific threshold the user has defined, > >> which > >> is a manageable trade-off. > >> > >> I have updated the KIP accordingly. > >> > >> Best regards, > >> Uladzislau Blok > >> > >> On Tue, Jan 27, 2026 at 8:19 AM Matthias J. Sax <[email protected]> > wrote: > >> > >>> Thanks for raising both points. > >>> > >>> The global store one is tricky. Not sure atm. The good thing is of > >>> course, that this new feature is disable by default. Maybe it would be > >>> sufficient to call out this edge case in the docs explicitly, calling > >>> for caution, but leave it up the user to decide? -- Maybe others have > >>> some ideas? > >>> > >>> > >>> About increasing `state.cleanup.delay.ms` -- I am not convinced it > would > >>> be a good idea. I would propose two alternatives. > >>> > >>> - extend the doc to tell users to consider increasing this config, > if > >>> they use this new feature > >>> > >>> - change `state.cleanup.on.start` from a boolean to a long, with > >>> default value `-1` (for disabled) and let users decide what age > >>> threshold they want to apply when enabling the feature, effectively > >>> decoupling the new feature from `state.cleanup.delay.ms` config. > >>> > >>> Thoughts? > >>> > >>> > >>> -Matthias > >>> > >>> On 1/18/26 11:01 AM, Uladzislau Blok wrote: > >>>> Hello Matthias, > >>>> > >>>> Thanks for the feedback on the KIP. > >>>> > >>>> It seems we had a slight misunderstanding regarding the cleanup logic, > >>> but > >>>> after revisiting the ticket and the existing codebase, your > >>>> suggestion to > >>>> wipe stores older than state.cleanup.delay.ms makes perfect sense. I > >>> have > >>>> updated the KIP accordingly, and it is now ready for a second round of > >>>> review. > >>>> > >>>> I would like to highlight two specific points for further discussion: > >>>> > >>>> - > >>>> > >>>> This proposal might cause global stores to be deleted if they > >>>> aren't > >>>> updated often. Currently, we check the last modification time > >>>> of the > >>>> directory. If a global table hasn't changed, it might be > >>>> cleaned up > >>> even if > >>>> the data is still valid. However, since these tables are usually > >>> small, > >>>> this might not be a major issue. What do you think? > >>>> - > >>>> > >>>> We previously discussed increasing the default value for > >>>> state.cleanup.delay.ms to be less aggressive. Do we have any > >>> consensus > >>>> on a reasonable default, or a recommended methodology for > >>>> measuring > >>> what > >>>> this value should be? > >>>> > >>>> Regards, > >>>> Uladzislau Blok. > >>>> > >>>> On Mon, Jan 12, 2026 at 2:55 AM Matthias J. Sax <[email protected]> > >>> wrote: > >>>> > >>>>> Thanks for the KIP Uladzislau. > >>>>> > >>>>> Given that you propose to wipe the entire state if this config is > >>>>> set, I > >>>>> am wondering if we would need such a config to begin with, or if > users > >>>>> could implement this themselves (via some custom config the > >>>>> application > >>>>> code uses) and calls `KafkaStreams#cleanUp()` to wipe out all local > >>>>> state if this custom config is set? > >>>>> > >>>>> I believe to remember from the original ticket discussion, that the > >>>>> idea > >>>>> was not to blindly wipe the entire state, but to do it still based on > >>>>> task directory age, similar to what the background cleaner thread > does > >>>>> (based on `state.cleanup.delay.ms` config). And to trigger a cleanup > >>> run > >>>>> before startup. Thoughts? > >>>>> > >>>>> > >>>>> -Matthias > >>>>> > >>>>> On 12/21/25 6:37 AM, Uladzislau Blok wrote: > >>>>>> Hi everyone, > >>>>>> > >>>>>> I'd like to start a discussion on *KIP-1259: Add configuration to > >>>>>> wipe > >>>>>> local state on startup*. > >>>>>> Problem > >>>>>> > >>>>>> Currently, Kafka Streams can encounter a "zombie data" issue when an > >>>>>> instance restarts using stale local files after a period exceeding > >>>>>> the > >>>>>> changelog topic's delete.retention.ms. If the local checkpoint > offset > >>> is > >>>>>> still within the broker's available log range (due to long-lived > >>>>> entities), > >>>>>> an automatic reset isn't triggered. However, since the broker has > >>> already > >>>>>> purged deletion tombstones, the state store is rehydrated without > the > >>>>>> "delete" instructions, causing previously deleted entities to > >>>>> unexpectedly > >>>>>> reappear in the local RocksDB. > >>>>>> Proposed Solution > >>>>>> > >>>>>> I propose introducing a new configuration, state.cleanup.on.start > >>>>> (Boolean, > >>>>>> default: false). When enabled, this property forces the deletion > >>>>>> of all > >>>>>> local state directories and checkpoint files during application > >>>>>> initialization. This ensures the state is rebuilt entirely from the > >>>>>> changelog—the broker's "source of truth"—effectively purging any > >>> expired > >>>>>> zombie records. > >>>>>> > >>>>>> This is particularly useful for environments with persistent volumes > >>>>> where > >>>>>> instances might remain dormant for long periods (e.g., multi-region > >>>>>> failover). > >>>>>> > >>>>>> *KIP Link: * > >>>>>> > >>>>> > >>> https://cwiki.apache.org/confluence/display/KAFKA/ > >>> > KIP-1259%3A+Add+configuration+to+wipe+Kafka+Streams+local+state+on+startup > >>>>>> > >>>>>> > >>>>>> I look forward to your feedback and suggestions. > >>>>>> > >>>>>> > >>>>>> Best regards, > >>>>>> > >>>>>> Uladzislau Blok > >>>>>> > >>>>> > >>>>> > >>>> > >>> > >>> > >> > > > >
