Re: SPIP: Auto CDC support for Apache Spark

Jungtaek Lim Sun, 29 Mar 2026 18:48:03 -0700

I think there is a confusion - IIUC, recent proposals about CDC are mostly
assuming simple transformation ('stateless' in terms of streaming query).
CDC events will still be considered as append-only for the operators,
rather than the operators truly understanding how to retract over
deletion/update. Making CDC events to be truly reflected in the semantic of
the query would require a lot of effort and probably much bigger scope than
current proposals about CDC.


On Sun, Mar 29, 2026 at 8:42 AM vaquar khan <[email protected]> wrote:

> HI .
> Thanks for the SPIP. I fully support the goal-abstracting CDC merge logic
> is a huge win for the community. However, looking at the current Spark
> versions, there are significant architectural gaps between Databricks
> Lakeflow's proprietary implementation and OSS Spark.
>
> A few technical blockers need clarification before we move forward:
>
> - OSS Compatibility: Databricks documentation explicitly states that the
> AUTO CDC APIs are not supported by Apache Spark Declarative Pipelines
> <https://docs.databricks.com/gcp/en/ldp/cdc>.
>
> - Streaming MERGE: The proposed flow requires continuous upsert/delete
> semantics, but Dataset.mergeInto() currently does not support streaming
> queries. Does this SPIP introduce an entirely new execution path to bypass
> this restriction?
>
> - Tombstone Garbage Collection: Handling stream deletes safely requires
> state store tombstone retention (e.g., configuring
> pipelines.cdc.tombstoneGCThresholdInSeconds) to prevent late-arriving data
> from resurrecting deleted keys. How will this be implemented natively in
> OSS Spark state stores?
>
> - Sequencing Constraints: SEQUENCE BY enforces strict ordering where NULL
> sequencing values are explicitly not supported. How will the engine handle
> malformed or non-monotonic upstream sequences compared to our existing
> time-based watermarks?
>
> - Given the massive surface area (new SQL DDL, streaming MERGE paths, SCD
> Type 1/2 state logic, tombstone GC), a phased delivery plan would be very
> helpful. It would also clarify exactly which Lakeflow components are being
> contributed to open-source versus what needs to be rebuilt from scratch.
>
>
> Best regards,
> Viquar Khan
>
> On Sat, 28 Mar 2026 at 08:35, 陈 小健 <[email protected]> wrote:
>
>> unsubscribe
>>
>> 获取Outlook for Android <https://aka.ms/AAb9ysg>
>> ------------------------------
>> *From:* Andreas Neumann <[email protected]>
>> *Sent:* Saturday, March 28, 2026 2:43:54 AM
>> *To:* [email protected] <[email protected]>
>> *Subject:* Re: SPIP: Auto CDC support for Apache Spark
>>
>> Hi Vaibhav,
>>
>> The goal of this proposal is not to replace MERGE but to provide a simple
>> abstraction for the common use case of CDC.
>> MERGE itself is a very powerful operator and there will always be use
>> cases outside of CDC that will require MERGE.
>>
>> And thanks for spotting the typo in the SPIP. It is fixed now!
>>
>> Cheers -Andreas
>>
>>
>> On Fri, Mar 27, 2026 at 10:53 AM Vaibhav Kumar <[email protected]>
>> wrote:
>>
>> Hi Andrew,
>>
>> Thanks for sharing the SPIP, Does that mean the MERGE statement would be
>> deprecated? Also I think there was a small typo I have suggested in the
>> doc.
>>
>> Regards,
>> Vaibhav
>>
>> On Fri, Mar 27, 2026 at 10:15 AM DB Tsai <[email protected]> wrote:
>>
>> +1
>>
>> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
>>
>> On Mar 26, 2026, at 6:08 PM, Andreas Neumann <[email protected]> wrote:
>>
>> Hi all,
>>
>> I’d like to start a discussion on a new SPIP to introduce Auto CDC
>> support to Apache Spark.
>>
>>    - SPIP Document:
>>    
>> https://docs.google.com/document/d/1Hp5BGEYJRHbk6J7XUph3bAPZKRQXKOuV1PEaqZMMRoQ/
>>    -
>>
>>    JIRA: <https://issues.apache.org/jira/browse/SPARK-55668>
>>    https://issues.apache.org/jira/browse/SPARK-5566
>>
>> Motivation
>>
>> With the upcoming introduction of standardized CDC support
>> <https://issues.apache.org/jira/browse/SPARK-55668>, Spark will soon
>> have a unified way to produce change data feeds. However, consuming
>> these feeds and applying them to a target table remains a significant
>> challenge.
>>
>> Common patterns like SCD Type 1 (maintaining a 1:1 replica) and SCD Type
>> 2 (tracking full change history) often require hand-crafted, complex
>> MERGE logic. In distributed systems, these implementations are
>> frequently error-prone when handling deletions or out-of-order data.
>> Proposal
>>
>> This SPIP proposes a new "Auto CDC" flow type for Spark. It encapsulates
>> the complex logic for SCD types and out-of-order data, allowing data
>> engineers to configure a declarative flow instead of writing manual MERGE
>> statements. This feature will be available in both Python and SQL.
>> Example SQL:
>> -- Produce a change feed
>> CREATE STREAMING TABLE cdc.users AS
>> SELECT * FROM STREAM my_table CHANGES FROM VERSION 10;
>>
>> -- Consume the change feed
>> CREATE FLOW flow
>> AS AUTO CDC INTO
>>   target
>> FROM stream(cdc_data.users)
>>   KEYS (userId)
>>   APPLY AS DELETE WHEN operation = "DELETE"
>>   SEQUENCE BY sequenceNum
>>   COLUMNS * EXCEPT (operation, sequenceNum)
>>   STORED AS SCD TYPE 2
>>   TRACK HISTORY ON * EXCEPT (city);
>>
>>
>> Please review the full SPIP for the technical details. Looking forward to
>> your feedback and discussion!
>>
>> Best regards,
>>
>> Andreas
>>
>>
>>

Re: SPIP: Auto CDC support for Apache Spark

Reply via email to