Re: SPIP: Auto CDC support for Apache Spark

Andreas Neumann Mon, 30 Mar 2026 20:52:41 -0700

Hi Anton,

   - Yes, it is our intention to compile it down to a MERGE.
   - The detailed design will follow once the work is started; the purpose
   of this SPIP is to form alignment on this being a favorable improvement to
   Spark.
   - To support AutoCDC, changelogs are not required (producing a change
   log is the topic of an earlier SPIP). Keep in mind that the change feed
   originates from a different data source than the target of the Auto CDC.
   - What we will require for the target is only to support MERGE.


Cheers -Andreas

On Mon, Mar 30, 2026 at 8:49 AM Anton Okolnychyi <[email protected]>
wrote:

> Will auto CDC compile into or use MERGE under the hood? If yes, can we
> include a sketch of how that rewrite will look like in the SPIP? What
> exactly do connectors need to support to benefit from auto CDC? Just
> support changelogs and MERGE?
>
> - Anton
>
> пн, 30 бер. 2026 р. о 07:44 Andreas Neumann <[email protected]> пише:
>
>> Hi Vaquar,
>>
>> I responded to most of your comments on the document itself.
>> Addtional comments inline
>>
>> Cheers -Andreas
>>
>> On Sat, Mar 28, 2026 at 4:42 PM vaquar khan <[email protected]>
>> wrote:
>>
>>> HI .
>>> Thanks for the SPIP. I fully support the goal-abstracting CDC merge
>>> logic is a huge win for the community. However, looking at the current
>>> Spark versions, there are significant architectural gaps between Databricks
>>> Lakeflow's proprietary implementation and OSS Spark.
>>>
>>> A few technical blockers need clarification before we move forward:
>>>
>>> - OSS Compatibility: Databricks documentation explicitly states that the
>>> AUTO CDC APIs are not supported by Apache Spark Declarative Pipelines
>>> <https://docs.databricks.com/gcp/en/ldp/cdc>.
>>>
>> That will change with the implementation of this SPIP.
>>
>>
>>> - Streaming MERGE: The proposed flow requires continuous upsert/delete
>>> semantics, but Dataset.mergeInto() currently does not support streaming
>>> queries. Does this SPIP introduce an entirely new execution path to bypass
>>> this restriction
>>>
>> This works with foreachBatch.
>>
>>
>>> - Tombstone Garbage Collection: Handling stream deletes safely requires
>>> state store tombstone retention (e.g., configuring
>>> pipelines.cdc.tombstoneGCThresholdInSeconds) to prevent late-arriving data
>>> from resurrecting deleted keys. How will this be implemented natively in
>>> OSS Spark state stores?
>>>
>> That's an interesting question. Tombstones could be modeled in the state
>> store, but we are thinking that they will be modeled as an explicit output
>> of the flow, either as records in the output table with a "deleted at"
>> marker, possibly with a view on top to project away these rows; or in a
>> separate output that contains only the tombstones. The exact design is not
>> finalized, that is part of the first phase of the project.
>>
>> - Sequencing Constraints: SEQUENCE BY enforces strict ordering where NULL
>>> sequencing values are explicitly not supported. How will the engine handle
>>> malformed or non-monotonic upstream sequences compared to our existing
>>> time-based watermarks?
>>>
>> I think malformed change events should, at least in the first iteration,
>> fail the stream. Otherwise there is a risk of writing incorrect data.
>>
>>
>>>
>>> - Given the massive surface area (new SQL DDL, streaming MERGE paths,
>>> SCD Type 1/2 state logic, tombstone GC), a phased delivery plan would be
>>> very helpful. It would also clarify exactly which Lakeflow components are
>>> being contributed to open-source versus what needs to be rebuilt from
>>> scratch.
>>>
>>>
>>> Best regards,
>>> Viquar Khan
>>>
>>> On Sat, 28 Mar 2026 at 08:35, 陈 小健 <[email protected]> wrote:
>>>
>>>> unsubscribe
>>>>
>>>> 获取Outlook for Android <https://aka.ms/AAb9ysg>
>>>> ------------------------------
>>>> *From:* Andreas Neumann <[email protected]>
>>>> *Sent:* Saturday, March 28, 2026 2:43:54 AM
>>>> *To:* [email protected] <[email protected]>
>>>> *Subject:* Re: SPIP: Auto CDC support for Apache Spark
>>>>
>>>> Hi Vaibhav,
>>>>
>>>> The goal of this proposal is not to replace MERGE but to provide a
>>>> simple abstraction for the common use case of CDC.
>>>> MERGE itself is a very powerful operator and there will always be use
>>>> cases outside of CDC that will require MERGE.
>>>>
>>>> And thanks for spotting the typo in the SPIP. It is fixed now!
>>>>
>>>> Cheers -Andreas
>>>>
>>>>
>>>> On Fri, Mar 27, 2026 at 10:53 AM Vaibhav Kumar <[email protected]>
>>>> wrote:
>>>>
>>>> Hi Andrew,
>>>>
>>>> Thanks for sharing the SPIP, Does that mean the MERGE statement would
>>>> be deprecated? Also I think there was a small typo I have suggested in the
>>>> doc.
>>>>
>>>> Regards,
>>>> Vaibhav
>>>>
>>>> On Fri, Mar 27, 2026 at 10:15 AM DB Tsai <[email protected]> wrote:
>>>>
>>>> +1
>>>>
>>>> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
>>>>
>>>> On Mar 26, 2026, at 6:08 PM, Andreas Neumann <[email protected]> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> I’d like to start a discussion on a new SPIP to introduce Auto CDC
>>>> support to Apache Spark.
>>>>
>>>>    - SPIP Document:
>>>>    
>>>> https://docs.google.com/document/d/1Hp5BGEYJRHbk6J7XUph3bAPZKRQXKOuV1PEaqZMMRoQ/
>>>>    -
>>>>
>>>>    JIRA: <https://issues.apache.org/jira/browse/SPARK-55668>
>>>>    https://issues.apache.org/jira/browse/SPARK-5566
>>>>
>>>> Motivation
>>>>
>>>> With the upcoming introduction of standardized CDC support
>>>> <https://issues.apache.org/jira/browse/SPARK-55668>, Spark will soon
>>>> have a unified way to produce change data feeds. However, consuming
>>>> these feeds and applying them to a target table remains a significant
>>>> challenge.
>>>>
>>>> Common patterns like SCD Type 1 (maintaining a 1:1 replica) and SCD
>>>> Type 2 (tracking full change history) often require hand-crafted,
>>>> complex MERGE logic. In distributed systems, these implementations are
>>>> frequently error-prone when handling deletions or out-of-order data.
>>>> Proposal
>>>>
>>>> This SPIP proposes a new "Auto CDC" flow type for Spark. It
>>>> encapsulates the complex logic for SCD types and out-of-order data,
>>>> allowing data engineers to configure a declarative flow instead of writing
>>>> manual MERGE statements. This feature will be available in both Python
>>>> and SQL.
>>>> Example SQL:
>>>> -- Produce a change feed
>>>> CREATE STREAMING TABLE cdc.users AS
>>>> SELECT * FROM STREAM my_table CHANGES FROM VERSION 10;
>>>>
>>>> -- Consume the change feed
>>>> CREATE FLOW flow
>>>> AS AUTO CDC INTO
>>>>   target
>>>> FROM stream(cdc_data.users)
>>>>   KEYS (userId)
>>>>   APPLY AS DELETE WHEN operation = "DELETE"
>>>>   SEQUENCE BY sequenceNum
>>>>   COLUMNS * EXCEPT (operation, sequenceNum)
>>>>   STORED AS SCD TYPE 2
>>>>   TRACK HISTORY ON * EXCEPT (city);
>>>>
>>>>
>>>> Please review the full SPIP for the technical details. Looking forward
>>>> to your feedback and discussion!
>>>>
>>>> Best regards,
>>>>
>>>> Andreas
>>>>
>>>>
>>>>

Re: SPIP: Auto CDC support for Apache Spark

Reply via email to