Hi all,

I’d like to start a discussion on a new SPIP to introduce Auto CDC support
to Apache Spark.

   -

   SPIP Document:
   
https://docs.google.com/document/d/1Hp5BGEYJRHbk6J7XUph3bAPZKRQXKOuV1PEaqZMMRoQ/
   -

   JIRA: <https://issues.apache.org/jira/browse/SPARK-55668>
   https://issues.apache.org/jira/browse/SPARK-5566

Motivation

With the upcoming introduction of standardized CDC support
<https://issues.apache.org/jira/browse/SPARK-55668>, Spark will soon have a
unified way to produce change data feeds. However, consuming these feeds
and applying them to a target table remains a significant challenge.

Common patterns like SCD Type 1 (maintaining a 1:1 replica) and SCD Type 2
(tracking full change history) often require hand-crafted, complex MERGE
logic. In distributed systems, these implementations are frequently
error-prone when handling deletions or out-of-order data.
Proposal

This SPIP proposes a new "Auto CDC" flow type for Spark. It encapsulates
the complex logic for SCD types and out-of-order data, allowing data
engineers to configure a declarative flow instead of writing manual MERGE
statements. This feature will be available in both Python and SQL.

Example SQL:

-- Produce a change feed

CREATE STREAMING TABLE cdc.users AS

SELECT * FROM STREAM my_table CHANGES FROM VERSION 10;


-- Consume the change feed

CREATE FLOW flow

AS AUTO CDC INTO

  target

FROM stream(cdc_data.users)

  KEYS (userId)

  APPLY AS DELETE WHEN operation = "DELETE"

  SEQUENCE BY sequenceNum

  COLUMNS * EXCEPT (operation, sequenceNum)

  STORED AS SCD TYPE 2

  TRACK HISTORY ON * EXCEPT (city);


Please review the full SPIP for the technical details. Looking forward to
your feedback and discussion!

Best regards,

Andreas

Reply via email to