ericm-db opened a new pull request, #56018:
URL: https://github.com/apache/spark/pull/56018

   ### What changes were proposed in this pull request?
   
   Refactor `CommitLog` so that the commit log metadata is dispatched through a 
`CommitMetadataBase` trait with concrete `CommitMetadata` (V1, watermark only) 
and `CommitMetadataV2` (watermark + `stateUniqueIds`) case classes. The 
deserializer now reads the wire-format version from the file header and 
constructs the matching subclass.
   
   This is preparation for `CommitMetadataV3` (which adds sink metadata for 
streaming sink evolution) in a follow-up PR.
   
   Notable changes:
   - Add `CommitMetadataBase` trait and `CommitMetadataV2` case class.
   - `CommitMetadata` becomes V1 (no `stateUniqueIds` field).
   - Add `CommitLog.createMetadata` factory that dispatches by version and 
defaults to the configured `STATE_STORE_CHECKPOINT_FORMAT_VERSION`.
   - `CommitLog.readCommitMetadata` reads the version line and constructs the 
matching subclass.
   - `MicroBatchExecution`, `OfflineStateRepartitionRunner`, and existing tests 
updated to use the new types/factory.
   
   This PR is the first follow-up in the SPARK-56719 sink-evolution series. The 
next two follow-ups are stacked on top of this branch (SPARK-56971: add 
`CommitMetadataV3` + `SinkMetadataInfo`; SPARK-56972: wire sink name 
persistence through `MicroBatchExecution`).
   
   ### Why are the changes needed?
   
   The pre-refactor `CommitMetadata` carried both the V1 and V2 wire shape in a 
single case class, with `stateUniqueIds` optional. That made it awkward to add 
a V3 wire format with additional fields, and forced `serialize` to take the 
wire version from `SQLConf` rather than from the metadata itself.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No new public API. The wire format for V1 changes slightly: V1 commit log 
files no longer serialize `stateUniqueIds: null`. Old V1 files continue to be 
read because the V1 deserializer ignores the (now-unknown) field.
   
   This PR also relaxes the version-exact-match check on read so that a commit 
log opened with the V2 conf can deserialize a V1 file. This incidentally 
resolves SPARK-50653.
   
   ### How was this patch tested?
   
   - Existing `CommitLogSuite` (V1, V2, and cross-version) passes; the 
cross-version test now asserts successful V1 deserialization.
   - `StreamingSinkEvolutionSuite` (from SPARK-56719) still passes.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code (claude-opus-4-7)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to