Hi all,

I was crash testing a Kafka Streams 4.1.2 app (exactly_once_v2, default
state updater, persistent RocksDB stores, commit.interval.ms=500, running
on Kubernetes StatefulSets with persistent volumes) and found behavior that
doesn't match my understanding of the EOS state store model. Wanted a
sanity check from people who know the internals before filing a JIRA. Also
my JIRA account request is pending on selfserve (username: zoro3010200,
email: [email protected]), if a PMC member can approve it that would
help a lot.

What I observed (reproduced twice):

1. After restoration completes, a .checkpoint file exists in the task
directory and stays there unchanged for the whole RUNNING session. Frozen
restore-time offsets, mtime never moves even while processing around 16k
rec/s.
2. SIGKILL during active processing, zero grace period. On restart the
instance finds that stale checkpoint, logs "State store X initialized from
checkpoint with offset ...", no TaskCorruptedException, no wipe, and just
replays the changelog tail.
3. The wipe path only fired in a different test where the crash happened
during restoration itself, before any checkpoint entry existed.

Code paths I traced on the 4.1 branch:

- DefaultStateUpdater.maybeCompleteRestoration calls
task.maybeCheckpoint(true) on restore completion with no EOS condition.
- Nothing deletes that file on the restored to RUNNING transition.
StreamTask.completeRestoration writes its own checkpoint only if
!eosEnabled, which makes me think the intent was that no checkpoint should
exist past that point under EOS.
- During RUNNING under EOS, postCommit only checkpoints when
enforceCheckpoint=true, so the file never gets refreshed or removed in
steady state.
- The only delete sites I could find: init time
(ProcessorStateManager.initializeStoreOffsetsFromCheckpoint), resume from
SUSPENDED (KAFKA-10362), and removeCheckpointForCorruptedTask.

Why I think this matters: RocksDBStore disables the WAL and Streams doesn't
flush on commit under EOS, so a background memtable flush landing mid
transaction can persist writes from a transaction that later gets aborted.
The tail replay runs read_committed and skips those aborted records, so it
can't clean them up. Reprocessing of the uncommitted input offsets papers
over it for deterministic input driven writes, but state written by wall
clock punctuators never gets regenerated, and read-before-write logic
(dedup on an ID for example) can read the leftover value and silently skip
a redelivered record. The KIP-892 motivation section says EOS has to wipe
task state on crash because data gets written to the store before the
changelog commit completes, and this behavior bypasses that wipe.

KAFKA-10362 fixed what looks like the same class of issue on the resume
path with an explicit deleteCheckPointFileIfEOSEnabled(). The restored to
RUNNING transition seems to be missing the same delete ever since the state
updater became the default.

So my questions: is this known? Intended? Is there some safety argument I'm
missing? If it's a real gap I'm happy to file the JIRA with full logs and
disk evidence from the crash tests, and help test a fix.

Thanks,

Reply via email to