Hi Spark devs,
I'd like to call a vote on the SPIP*: Change Data Capture (CDC) Support*
*Summary:*
This SPIP proposes a unified approach by adding a CHANGES SQL clause and
corresponding DataFrame/DataStream APIs that work across DSv2 connectors.
1. Standardized User API
SQL:
-- Batch: What changed between version 10 and 20?
SELECT * FROM my_table CHANGES FROM VERSION 10 TO VERSION 20;
-- Streaming: Continuously process changes
CREATE STREAMING TABLE cdc_sink AS
SELECT * FROM STREAM my_table CHANGES FROM VERSION 0;
DataFrame API:
spark.read
.option("startingVersion", "10")
.option("endingVersion", "20")
.changes("my_table")
2. Engine-Level Post Processing Under the hood, this proposal introduces a
minimal Changelog interface for DSv2 connectors. Spark's Catalyst optimizer
will take over the CDC post-processing, including:
-
Filtering out copy-on-write carry-over rows.
-
Deriving pre-image/post-image updates from raw insert/delete pairs.
-
Computing net changes.
*Relevant Links:*
- *SPIP Doc: *
https://docs.google.com/document/d/1-4rCS3vsGIyhwnkAwPsEaqyUDg-AuVkdrYLotFPw0U0/edit?usp=sharing
- *Discuss Thread: *
https://lists.apache.org/thread/dhxx6pohs7fvqc3knzhtoj4tbcgrwxts
- *JIRA: *https://issues.apache.org/jira/browse/SPARK-55668
*The vote will be open for at least 72 hours. *Please vote:
[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don't think this is a good idea because ...
Thanks,
Gengliang Wang