liviazhu opened a new pull request, #56332:
URL: https://github.com/apache/spark/pull/56332

   ### What changes were proposed in this pull request?
   
   This PR ensures the state data source read path never writes to the 
checkpoint, so reads
   work against read-only storage (e.g. a read-only cloud object store path).
   
   - `StatePartitionReader` / `StatePartitionAllColumnFamiliesReader`: open the 
store via
     `getReadStore` and `release()` on close instead of `getStore` + `abort()`. 
Column-family
     registration is an in-memory operation and is now supported on the read 
store via
     `ReadStateStore.createColFamilyIfAbsent`.
   - `StreamStreamJoinStatePartitionReader`: constructs 
`SymmetricHashJoinStateManager` with
     `readOnly = true`.
   - `SymmetricHashJoinStateManager` (V1/V2/V4): adds a `readOnly` constructor 
flag. The store
     handler exposes a writable `stateStore` (guarded with 
`require(!readOnly)`) and a mode-aware
     `readStateStore`. Inner-store reads and `createColFamilyIfAbsent` route 
through
     `readStateStore`; writes stay on `stateStore`. `abortIfNeeded` calls 
`release()` in read-only
     mode.
   - `StateStore`: `ReadStateStore.createColFamilyIfAbsent` is declared on the 
trait with a default
     `UnsupportedOperationException` and overridden by `WrappedReadStateStore`.
   - `HDFSBackedStateStoreProvider`: `baseDir` mkdirs is deferred from `init()` 
to the first write
     (`createBaseDirIfNotExists`), so read-only callers never mkdirs on the 
checkpoint; `release()`
     is state-machine-guarded; `replayStateFromSnapshot` rejects `readOnly = 
true` and directs
     callers to `replayReadStateFromSnapshot`.
   
   A test framework (`WriteProtectedLocalFileSystem` / 
`WriteProtectedAbstractFileSystem` /
   `WriteProtectedPaths` / `WriteProtectedCheckpointTestMixin`) installs 
write-protected
   filesystems and auto-protects every `withTempDir`; `testStream` and 
`withWritableCheckpoint`
   temporarily suspend protection for legitimate writes. It is mixed into 
`StateDataSourceTestBase`
   so existing tests gain enforcement transparently.
   
   ### Why are the changes needed?
   
   Reading state via the state data source could previously issue writes 
(mkdirs, taking a writable
   store) to the checkpoint path, which fails when the checkpoint lives on 
read-only storage.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   `StateDataSourceReadSuite` (HDFS and RocksDB variants), 
`StateDataSourceChangeDataReadSuite`,
   `StateDataSourceTransformWithStateSuite`, 
`StatePartitionAllColumnFamilies{Reader,Writer}Suite`,
   `OfflineStateRepartitionIntegrationSuite`, and `StateStoreSuite`. A 
framework-sanity test and
   new read-only enforcement tests were added. A regression check confirmed the 
framework catches
   an unintended `mkdir` on the read path.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Yes, authored with assistance from Claude (Claude Code).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to