[PR] [SPARK-56870][SDP] Extend Microbatch with CDC Metadata [spark]

via GitHub Mon, 18 May 2026 19:33:03 -0700


AnishMahto opened a new pull request, #55970:
URL: https://github.com/apache/spark/pull/55970

Approved AutoCDC SPIP:
https://lists.apache.org/thread/j6sj9wo9odgdpgzlxtvhoy7szs0jplf7

--------

This is a stacked PR. Review incremental diff here:
https://github.com/AnishMahto/spark/compare/SPARK-56856-SCD1-microbatch-deduplication...SPARK-56870-extend-microbatch-with-cdc-metadata

--------

Preamble:

The SCD type 1 flow is a foreachBatch streaming query on an input
change-data-feed, and is responsible for reconciling the incoming change data
onto some target table that follows SCD1 replication semantics.

SCD1 flows also maintain an "auxiliary" table to keep track of
early-arriving out-of-order received events state. Each microbatch will need to
reconcile against this auxiliary table as well, and update the auxiliary
table's state appropriately for future microbatches.

**Extend Microbatch with CDC Metadata:**

After deduplication, all of the incoming rows can be classified as either a
delete event or an upsert event (mutually exclusive), and there's at most one
per key.

If we identify a row as a delete event, remember its sequencing as its
`deleteVersion`. If we identify a row as an upsert event, remember its
sequencing as its `upsertVersion`. That is, `deleteVersion`/`upsertVersion`
encode both the sequencing for the row as well as the row classification
(delete or upsert).

We need to persist this encoded information now, because in future stages we
may drop the columns that `deleteCondition` needed to do the classification in
the first place, depending on which columns were selected by
`ChangeArgs.columnSelection`.

**Where is the CDC Metadata stored?**

Within the microbatch, we append a `_cdc_metadata` struct column, that
stores the `deleteSequence` and `upsertSequence`.

This `_cdc_metadata` column will eventually also land in the persisted
target and auxiliary tables, which are the artifacts of an AutoCDC flow. This
column represents operational metadata that the AutoCDC flow has tagged a row
with, and is necessary for out-of-order correctness of the SCD decomposition.

Users will not be able to opt out of persisting this column in the target
table using `ChangeArgs.columnSelection`, as it is necessary for correctness.
The column will not have a stable public contract, and users should make no
assumptions on its contents.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-56870][SDP] Extend Microbatch with CDC Metadata [spark]

Reply via email to