Re: SPIP: Auto CDC support for Apache Spark

Shixiong Zhu Thu, 26 Mar 2026 20:49:09 -0700

As the shepherd for this proposal, I would like to add an observation from
my experience: I have seen many workloads break because hand-written MERGE
logic for out-of-order data in streaming is extremely difficult to code
correctly.


This SPIP will eliminate a significant amount of complex, error-prone MERGE
code. Big +1 from me.

Additionally, please note a correction for the JIRA ticket:
https://issues.apache.org/jira/browse/SPARK-56249.

Best regards,

Shixiong Zhu

On Thu, Mar 26, 2026 at 6:08 PM Andreas Neumann <[email protected]> wrote:

> Hi all,
>
> I’d like to start a discussion on a new SPIP to introduce Auto CDC
> support to Apache Spark.
>
>    -
>
>    SPIP Document:
>    
> https://docs.google.com/document/d/1Hp5BGEYJRHbk6J7XUph3bAPZKRQXKOuV1PEaqZMMRoQ/
>    -
>
>    JIRA: <https://issues.apache.org/jira/browse/SPARK-55668>
>    https://issues.apache.org/jira/browse/SPARK-5566
>
> Motivation
>
> With the upcoming introduction of standardized CDC support
> <https://issues.apache.org/jira/browse/SPARK-55668>, Spark will soon have
> a unified way to produce change data feeds. However, consuming these
> feeds and applying them to a target table remains a significant challenge.
>
> Common patterns like SCD Type 1 (maintaining a 1:1 replica) and SCD Type 2
> (tracking full change history) often require hand-crafted, complex MERGE
> logic. In distributed systems, these implementations are frequently
> error-prone when handling deletions or out-of-order data.
> Proposal
>
> This SPIP proposes a new "Auto CDC" flow type for Spark. It encapsulates
> the complex logic for SCD types and out-of-order data, allowing data
> engineers to configure a declarative flow instead of writing manual MERGE
> statements. This feature will be available in both Python and SQL.
>
> Example SQL:
>
> -- Produce a change feed
>
> CREATE STREAMING TABLE cdc.users AS
>
> SELECT * FROM STREAM my_table CHANGES FROM VERSION 10;
>
>
> -- Consume the change feed
>
> CREATE FLOW flow
>
> AS AUTO CDC INTO
>
>   target
>
> FROM stream(cdc_data.users)
>
>   KEYS (userId)
>
>   APPLY AS DELETE WHEN operation = "DELETE"
>
>   SEQUENCE BY sequenceNum
>
>   COLUMNS * EXCEPT (operation, sequenceNum)
>
>   STORED AS SCD TYPE 2
>
>   TRACK HISTORY ON * EXCEPT (city);
>
>
> Please review the full SPIP for the technical details. Looking forward to
> your feedback and discussion!
>
> Best regards,
>
> Andreas
>
>

Re: SPIP: Auto CDC support for Apache Spark

Reply via email to