B. Micheal Okutubo created SPARK-54121:
------------------------------------------
Summary: Automatic Snapshot Repair for State store
Key: SPARK-54121
URL: https://issues.apache.org/jira/browse/SPARK-54121
Project: Spark
Issue Type: Task
Components: Structured Streaming
Affects Versions: 4.1.0
Reporter: B. Micheal Okutubo
Today, the engine currently treats both the changelog and snapshot files with
the same importance, so when a state store needs to be loaded it reads the
latest snapshot and applies the subsequent changes on it. If the snapshot is
bad or corrupted, then the query will fail and be completely down and blocked,
needing manual intervention. This leads to user clearing their query checkpoint
and having to do full recomputation.
This shouldn’t be the case. The changelog should be treated as the “source of
truth” and the snapshot is just a disposable materialization of the log.
Introducing Automatic snapshot repair, which will automatically repair the
checkpoint by skipping bad snapshots and rebuilding the current state from the
last good snapshot (works even if there’s none) and applying the changelogs on
it. This eliminates the need for manual intervention and unblocks the pipeline
to keep it running.
Also emit metrics about number of state stores that were auto repaired in a
given batch, so that you can build alert and dashboard for it.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]