Hyukjin Kwon created SPARK-57658:
------------------------------------

             Summary: Report the available snapshot versions when a RocksDB 
state snapshot load fails
                 Key: SPARK-57658
                 URL: https://issues.apache.org/jira/browse/SPARK-57658
             Project: Spark
          Issue Type: Improvement
          Components: Structured Streaming
    Affects Versions: 5.0.0
            Reporter: Hyukjin Kwon


When loading state from a snapshot (e.g. reading with the 
{{snapshotStartBatchId}} option) the snapshot zip for the requested version can 
be missing, most commonly because the asynchronous maintenance thread has not 
uploaded it yet. Today this surfaces only as:
{code}
[CANNOT_LOAD_STATE_STORE.UNCATEGORIZED] ...
Caused by: java.io.FileNotFoundException: .../state/0/1/2.zip does not exist
{code}
which gives no indication of whether the snapshot was never created or merely 
not uploaded in time. This has caused hard-to-diagnose intermittent failures in 
scheduled CI (the {{snapshotStartBatchId ... transformWithState}} tests, across 
master/branch-4.x/branch-4.2 and both plain and row-checksum variants).

This enriches the {{FileNotFoundException}} thrown from {{RocksDBFileManager}} 
with the snapshot (.zip) and changelog (.changelog) files that ARE present in 
the DFS checkpoint root, so any future occurrence is self-diagnosing from the 
logs (e.g. 'asked for 2.zip, only 1.zip present' clearly indicates the async 
upload race). The listing is best-effort and never throws. A unit test in 
RocksDBSuite validates the enriched message.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to