HyukjinKwon opened a new pull request, #56718:
URL: https://github.com/apache/spark/pull/56718

   ## What changes were proposed in this pull request?
   
   **[DO-NOT-MERGE]** — CI-stabilization / observability change (draft).
   
   When loading state from a snapshot (e.g. reading with the 
`snapshotStartBatchId` option) the snapshot zip for the requested version can 
be missing — most often because the asynchronous maintenance thread has not 
uploaded it yet. Today that surfaces only as:
   
   ```
   [CANNOT_LOAD_STATE_STORE.UNCATEGORIZED] ...
   Caused by: java.io.FileNotFoundException: .../state/0/1/2.zip does not exist
   ```
   
   with no indication of whether the snapshot was never created or merely not 
uploaded in time. This PR enriches the `FileNotFoundException` thrown from 
`RocksDBFileManager` with the snapshot (`.zip`) and changelog (`.changelog`) 
files that **are** present in the DFS checkpoint root, so the situation is 
self-diagnosing from logs (e.g. *asked for 2.zip, only 1.zip present* clearly 
indicates the async-upload race). The listing is best-effort and never throws.
   
   A unit test in `RocksDBSuite` deterministically validates the enriched 
message.
   
   ## Why are the changes needed?
   
   The `snapshotStartBatchId ... transformWithState` tests have failed 
intermittently in scheduled Maven jobs (master, branch-4.x, branch-4.2; both 
plain and row-checksum variants) with exactly this opaque 
`FileNotFoundException`. The deterministic test-side fix was handled 
separately; this change makes any *future* recurrence — in any suite or 
scheduled job — immediately diagnosable instead of requiring artifact 
spelunking.
   
   ## Does this PR introduce any user-facing change?
   
   No (improves an error message on an already-failing path).
   
   ## How was this patch tested?
   
   New unit test `RocksDBFileManager: missing snapshot during load reports the 
available versions`, plus a focused fork workflow that runs `RocksDBSuite` once 
and repeats the snapshot suites 5× to confirm stability. The last commit 
(validation workflow) must be reverted before merge.
   
   This pull request and its description were written by Isaac.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to