Hi all, I was crash testing a Kafka Streams 4.1.2 app (exactly_once_v2, default state updater, persistent RocksDB stores, commit.interval.ms=500, running on Kubernetes StatefulSets with persistent volumes) and found behavior that doesn't match my understanding of the EOS state store model. Wanted a sanity check from people who know the internals before filing a JIRA. Also my JIRA account request is pending on selfserve (username: zoro3010200, email: [email protected]), if a PMC member can approve it that would help a lot.
What I observed (reproduced twice): 1. After restoration completes, a .checkpoint file exists in the task directory and stays there unchanged for the whole RUNNING session. Frozen restore-time offsets, mtime never moves even while processing around 16k rec/s. 2. SIGKILL during active processing, zero grace period. On restart the instance finds that stale checkpoint, logs "State store X initialized from checkpoint with offset ...", no TaskCorruptedException, no wipe, and just replays the changelog tail. 3. The wipe path only fired in a different test where the crash happened during restoration itself, before any checkpoint entry existed. Code paths I traced on the 4.1 branch: - DefaultStateUpdater.maybeCompleteRestoration calls task.maybeCheckpoint(true) on restore completion with no EOS condition. - Nothing deletes that file on the restored to RUNNING transition. StreamTask.completeRestoration writes its own checkpoint only if !eosEnabled, which makes me think the intent was that no checkpoint should exist past that point under EOS. - During RUNNING under EOS, postCommit only checkpoints when enforceCheckpoint=true, so the file never gets refreshed or removed in steady state. - The only delete sites I could find: init time (ProcessorStateManager.initializeStoreOffsetsFromCheckpoint), resume from SUSPENDED (KAFKA-10362), and removeCheckpointForCorruptedTask. Why I think this matters: RocksDBStore disables the WAL and Streams doesn't flush on commit under EOS, so a background memtable flush landing mid transaction can persist writes from a transaction that later gets aborted. The tail replay runs read_committed and skips those aborted records, so it can't clean them up. Reprocessing of the uncommitted input offsets papers over it for deterministic input driven writes, but state written by wall clock punctuators never gets regenerated, and read-before-write logic (dedup on an ID for example) can read the leftover value and silently skip a redelivered record. The KIP-892 motivation section says EOS has to wipe task state on crash because data gets written to the store before the changelog commit completes, and this behavior bypasses that wipe. KAFKA-10362 fixed what looks like the same class of issue on the resume path with an explicit deleteCheckPointFileIfEOSEnabled(). The restored to RUNNING transition seems to be missing the same delete ever since the state updater became the default. So my questions: is this known? Intended? Is there some safety argument I'm missing? If it's a real gap I'm happy to file the JIRA with full logs and disk evidence from the crash tests, and help test a fix. Thanks,
