Re: SPIP: Auto CDC support for Apache Spark

Anton Okolnychyi Mon, 30 Mar 2026 08:48:49 -0700

Will auto CDC compile into or use MERGE under the hood? If yes, can we
include a sketch of how that rewrite will look like in the SPIP? What
exactly do connectors need to support to benefit from auto CDC? Just
support changelogs and MERGE?


- Anton

пн, 30 бер. 2026 р. о 07:44 Andreas Neumann <[email protected]> пише:

> Hi Vaquar,
>
> I responded to most of your comments on the document itself.
> Addtional comments inline
>
> Cheers -Andreas
>
> On Sat, Mar 28, 2026 at 4:42 PM vaquar khan <[email protected]> wrote:
>
>> HI .
>> Thanks for the SPIP. I fully support the goal-abstracting CDC merge logic
>> is a huge win for the community. However, looking at the current Spark
>> versions, there are significant architectural gaps between Databricks
>> Lakeflow's proprietary implementation and OSS Spark.
>>
>> A few technical blockers need clarification before we move forward:
>>
>> - OSS Compatibility: Databricks documentation explicitly states that the
>> AUTO CDC APIs are not supported by Apache Spark Declarative Pipelines
>> <https://docs.databricks.com/gcp/en/ldp/cdc>.
>>
> That will change with the implementation of this SPIP.
>
>
>> - Streaming MERGE: The proposed flow requires continuous upsert/delete
>> semantics, but Dataset.mergeInto() currently does not support streaming
>> queries. Does this SPIP introduce an entirely new execution path to bypass
>> this restriction
>>
> This works with foreachBatch.
>
>
>> - Tombstone Garbage Collection: Handling stream deletes safely requires
>> state store tombstone retention (e.g., configuring
>> pipelines.cdc.tombstoneGCThresholdInSeconds) to prevent late-arriving data
>> from resurrecting deleted keys. How will this be implemented natively in
>> OSS Spark state stores?
>>
> That's an interesting question. Tombstones could be modeled in the state
> store, but we are thinking that they will be modeled as an explicit output
> of the flow, either as records in the output table with a "deleted at"
> marker, possibly with a view on top to project away these rows; or in a
> separate output that contains only the tombstones. The exact design is not
> finalized, that is part of the first phase of the project.
>
> - Sequencing Constraints: SEQUENCE BY enforces strict ordering where NULL
>> sequencing values are explicitly not supported. How will the engine handle
>> malformed or non-monotonic upstream sequences compared to our existing
>> time-based watermarks?
>>
> I think malformed change events should, at least in the first iteration,
> fail the stream. Otherwise there is a risk of writing incorrect data.
>
>
>>
>> - Given the massive surface area (new SQL DDL, streaming MERGE paths, SCD
>> Type 1/2 state logic, tombstone GC), a phased delivery plan would be very
>> helpful. It would also clarify exactly which Lakeflow components are being
>> contributed to open-source versus what needs to be rebuilt from scratch.
>>
>>
>> Best regards,
>> Viquar Khan
>>
>> On Sat, 28 Mar 2026 at 08:35, 陈 小健 <[email protected]> wrote:
>>
>>> unsubscribe
>>>
>>> 获取Outlook for Android <https://aka.ms/AAb9ysg>
>>> ------------------------------
>>> *From:* Andreas Neumann <[email protected]>
>>> *Sent:* Saturday, March 28, 2026 2:43:54 AM
>>> *To:* [email protected] <[email protected]>
>>> *Subject:* Re: SPIP: Auto CDC support for Apache Spark
>>>
>>> Hi Vaibhav,
>>>
>>> The goal of this proposal is not to replace MERGE but to provide a
>>> simple abstraction for the common use case of CDC.
>>> MERGE itself is a very powerful operator and there will always be use
>>> cases outside of CDC that will require MERGE.
>>>
>>> And thanks for spotting the typo in the SPIP. It is fixed now!
>>>
>>> Cheers -Andreas
>>>
>>>
>>> On Fri, Mar 27, 2026 at 10:53 AM Vaibhav Kumar <[email protected]>
>>> wrote:
>>>
>>> Hi Andrew,
>>>
>>> Thanks for sharing the SPIP, Does that mean the MERGE statement would be
>>> deprecated? Also I think there was a small typo I have suggested in the
>>> doc.
>>>
>>> Regards,
>>> Vaibhav
>>>
>>> On Fri, Mar 27, 2026 at 10:15 AM DB Tsai <[email protected]> wrote:
>>>
>>> +1
>>>
>>> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
>>>
>>> On Mar 26, 2026, at 6:08 PM, Andreas Neumann <[email protected]> wrote:
>>>
>>> Hi all,
>>>
>>> I’d like to start a discussion on a new SPIP to introduce Auto CDC
>>> support to Apache Spark.
>>>
>>>    - SPIP Document:
>>>    
>>> https://docs.google.com/document/d/1Hp5BGEYJRHbk6J7XUph3bAPZKRQXKOuV1PEaqZMMRoQ/
>>>    -
>>>
>>>    JIRA: <https://issues.apache.org/jira/browse/SPARK-55668>
>>>    https://issues.apache.org/jira/browse/SPARK-5566
>>>
>>> Motivation
>>>
>>> With the upcoming introduction of standardized CDC support
>>> <https://issues.apache.org/jira/browse/SPARK-55668>, Spark will soon
>>> have a unified way to produce change data feeds. However, consuming
>>> these feeds and applying them to a target table remains a significant
>>> challenge.
>>>
>>> Common patterns like SCD Type 1 (maintaining a 1:1 replica) and SCD
>>> Type 2 (tracking full change history) often require hand-crafted,
>>> complex MERGE logic. In distributed systems, these implementations are
>>> frequently error-prone when handling deletions or out-of-order data.
>>> Proposal
>>>
>>> This SPIP proposes a new "Auto CDC" flow type for Spark. It
>>> encapsulates the complex logic for SCD types and out-of-order data,
>>> allowing data engineers to configure a declarative flow instead of writing
>>> manual MERGE statements. This feature will be available in both Python
>>> and SQL.
>>> Example SQL:
>>> -- Produce a change feed
>>> CREATE STREAMING TABLE cdc.users AS
>>> SELECT * FROM STREAM my_table CHANGES FROM VERSION 10;
>>>
>>> -- Consume the change feed
>>> CREATE FLOW flow
>>> AS AUTO CDC INTO
>>>   target
>>> FROM stream(cdc_data.users)
>>>   KEYS (userId)
>>>   APPLY AS DELETE WHEN operation = "DELETE"
>>>   SEQUENCE BY sequenceNum
>>>   COLUMNS * EXCEPT (operation, sequenceNum)
>>>   STORED AS SCD TYPE 2
>>>   TRACK HISTORY ON * EXCEPT (city);
>>>
>>>
>>> Please review the full SPIP for the technical details. Looking forward
>>> to your feedback and discussion!
>>>
>>> Best regards,
>>>
>>> Andreas
>>>
>>>
>>>

Re: SPIP: Auto CDC support for Apache Spark

Reply via email to