B. Micheal Okutubo created SPARK-54121:
------------------------------------------

             Summary: Automatic Snapshot Repair for State store
                 Key: SPARK-54121
                 URL: https://issues.apache.org/jira/browse/SPARK-54121
             Project: Spark
          Issue Type: Task
          Components: Structured Streaming
    Affects Versions: 4.1.0
            Reporter: B. Micheal Okutubo


Today, the engine currently treats both the changelog and snapshot files with 
the same importance, so when a state store needs to be loaded it reads the 
latest snapshot and applies the subsequent changes on it. If the snapshot is 
bad or corrupted, then the query will fail and be completely down and blocked, 
needing manual intervention. This leads to user clearing their query checkpoint 
and having to do full recomputation.

This shouldn’t be the case. The changelog should be treated as the “source of 
truth” and the snapshot is just a disposable materialization of the log.

Introducing Automatic snapshot repair, which will automatically repair the 
checkpoint by skipping bad snapshots and rebuilding the current state from the 
last good snapshot (works even if there’s none) and applying the changelogs on 
it. This eliminates the need for manual intervention and unblocks the pipeline 
to keep it running.

Also emit metrics about number of state stores that were auto repaired in a 
given batch, so that you can build alert and dashboard for it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to