Re: SPIP: Auto CDC support for Apache Spark

Anton Okolnychyi Tue, 31 Mar 2026 10:09:41 -0700

I think this is going to be a great feature if it naturally extends the
current capabilities of Spark and complements / builds on top of changelogs
and MERGE.


 I would be really worried otherwise.

On Tue, Mar 31, 2026 at 6:02 AM Andreas Neumann <[email protected]> wrote:

> Hi Mich,
>
> I agree that is, in theory, possible to implement this for connectors that
> do not support MERGE. But it is our intention to rely on MERGE support for
> this feature. If there is large interest for it, we could follow up with an
> implementation that does not require MERGE, but I would want to evaluate
> first whether the additional complexity is justified.
>
> Note also that we will not require changelogs. That is a requirement for
> the source that produces the CDC feed, as addressed by this existing SPIP: 
> SPIP:
> Change Data Capture (CDC) Support
> <https://docs.google.com/document/d/1-4rCS3vsGIyhwnkAwPsEaqyUDg-AuVkdrYLotFPw0U0/edit?usp=sharing>.
> I actually expect that many use cases will come from sources that do not
> support MERGE but do support change feeds.
>
> Cheers -Andreas
>
> On Mon, Mar 30, 2026 at 2:09 PM Mich Talebzadeh <[email protected]>
> wrote:
>
>> Hello,
>>
>>  Auto CDC may compile to a MERGE-like row-level plan, but the SPIP should
>> describe that as one implementation strategy, *not necessarily the only
>> one*. Connectors do not just need changelogs and MERGE; they need
>> changelog semantics on the read side and row-level mutation capability on
>> the write side, plus keys and usually sequencing.
>>
>> HTH
>>
>> Dr Mich Talebzadeh,
>> Data Scientist | Distributed Systems (Spark) | Financial Forensics &
>> Metadata Analytics | Transaction Reconstruction | Audit & Evidence-Based
>> Analytics
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>> On Mon, 30 Mar 2026 at 16:49, Anton Okolnychyi <[email protected]>
>> wrote:
>>
>>> Will auto CDC compile into or use MERGE under the hood? If yes, can we
>>> include a sketch of how that rewrite will look like in the SPIP? What
>>> exactly do connectors need to support to benefit from auto CDC? Just
>>> support changelogs and MERGE?
>>>
>>> - Anton
>>>
>>> пн, 30 бер. 2026 р. о 07:44 Andreas Neumann <[email protected]> пише:
>>>
>>>> Hi Vaquar,
>>>>
>>>> I responded to most of your comments on the document itself.
>>>> Addtional comments inline
>>>>
>>>> Cheers -Andreas
>>>>
>>>> On Sat, Mar 28, 2026 at 4:42 PM vaquar khan <[email protected]>
>>>> wrote:
>>>>
>>>>> HI .
>>>>> Thanks for the SPIP. I fully support the goal-abstracting CDC merge
>>>>> logic is a huge win for the community. However, looking at the current
>>>>> Spark versions, there are significant architectural gaps between 
>>>>> Databricks
>>>>> Lakeflow's proprietary implementation and OSS Spark.
>>>>>
>>>>> A few technical blockers need clarification before we move forward:
>>>>>
>>>>> - OSS Compatibility: Databricks documentation explicitly states that
>>>>> the AUTO CDC APIs are not supported by Apache Spark Declarative
>>>>> Pipelines <https://docs.databricks.com/gcp/en/ldp/cdc>.
>>>>>
>>>> That will change with the implementation of this SPIP.
>>>>
>>>>
>>>>> - Streaming MERGE: The proposed flow requires continuous upsert/delete
>>>>> semantics, but Dataset.mergeInto() currently does not support streaming
>>>>> queries. Does this SPIP introduce an entirely new execution path to bypass
>>>>> this restriction
>>>>>
>>>> This works with foreachBatch.
>>>>
>>>>
>>>>> - Tombstone Garbage Collection: Handling stream deletes safely
>>>>> requires state store tombstone retention (e.g., configuring
>>>>> pipelines.cdc.tombstoneGCThresholdInSeconds) to prevent late-arriving data
>>>>> from resurrecting deleted keys. How will this be implemented natively in
>>>>> OSS Spark state stores?
>>>>>
>>>> That's an interesting question. Tombstones could be modeled in the
>>>> state store, but we are thinking that they will be modeled as an explicit
>>>> output of the flow, either as records in the output table with a "deleted
>>>> at" marker, possibly with a view on top to project away these rows; or in a
>>>> separate output that contains only the tombstones. The exact design is not
>>>> finalized, that is part of the first phase of the project.
>>>>
>>>> - Sequencing Constraints: SEQUENCE BY enforces strict ordering where
>>>>> NULL sequencing values are explicitly not supported. How will the engine
>>>>> handle malformed or non-monotonic upstream sequences compared to our
>>>>> existing time-based watermarks?
>>>>>
>>>> I think malformed change events should, at least in the first
>>>> iteration, fail the stream. Otherwise there is a risk of writing incorrect
>>>> data.
>>>>
>>>>
>>>>>
>>>>> - Given the massive surface area (new SQL DDL, streaming MERGE paths,
>>>>> SCD Type 1/2 state logic, tombstone GC), a phased delivery plan would be
>>>>> very helpful. It would also clarify exactly which Lakeflow components are
>>>>> being contributed to open-source versus what needs to be rebuilt from
>>>>> scratch.
>>>>>
>>>>>
>>>>> Best regards,
>>>>> Viquar Khan
>>>>>
>>>>> On Sat, 28 Mar 2026 at 08:35, 陈 小健 <[email protected]> wrote:
>>>>>
>>>>>> unsubscribe
>>>>>>
>>>>>> 获取Outlook for Android <https://aka.ms/AAb9ysg>
>>>>>> ------------------------------
>>>>>> *From:* Andreas Neumann <[email protected]>
>>>>>> *Sent:* Saturday, March 28, 2026 2:43:54 AM
>>>>>> *To:* [email protected] <[email protected]>
>>>>>> *Subject:* Re: SPIP: Auto CDC support for Apache Spark
>>>>>>
>>>>>> Hi Vaibhav,
>>>>>>
>>>>>> The goal of this proposal is not to replace MERGE but to provide a
>>>>>> simple abstraction for the common use case of CDC.
>>>>>> MERGE itself is a very powerful operator and there will always be use
>>>>>> cases outside of CDC that will require MERGE.
>>>>>>
>>>>>> And thanks for spotting the typo in the SPIP. It is fixed now!
>>>>>>
>>>>>> Cheers -Andreas
>>>>>>
>>>>>>
>>>>>> On Fri, Mar 27, 2026 at 10:53 AM Vaibhav Kumar <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>> Hi Andrew,
>>>>>>
>>>>>> Thanks for sharing the SPIP, Does that mean the MERGE statement would
>>>>>> be deprecated? Also I think there was a small typo I have suggested in 
>>>>>> the
>>>>>> doc.
>>>>>>
>>>>>> Regards,
>>>>>> Vaibhav
>>>>>>
>>>>>> On Fri, Mar 27, 2026 at 10:15 AM DB Tsai <[email protected]> wrote:
>>>>>>
>>>>>> +1
>>>>>>
>>>>>> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
>>>>>>
>>>>>> On Mar 26, 2026, at 6:08 PM, Andreas Neumann <[email protected]> wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I’d like to start a discussion on a new SPIP to introduce Auto CDC
>>>>>> support to Apache Spark.
>>>>>>
>>>>>>    - SPIP Document:
>>>>>>    
>>>>>> https://docs.google.com/document/d/1Hp5BGEYJRHbk6J7XUph3bAPZKRQXKOuV1PEaqZMMRoQ/
>>>>>>    -
>>>>>>
>>>>>>    JIRA: <https://issues.apache.org/jira/browse/SPARK-55668>
>>>>>>    https://issues.apache.org/jira/browse/SPARK-5566
>>>>>>
>>>>>> Motivation
>>>>>>
>>>>>> With the upcoming introduction of standardized CDC support
>>>>>> <https://issues.apache.org/jira/browse/SPARK-55668>, Spark will soon
>>>>>> have a unified way to produce change data feeds. However, consuming
>>>>>> these feeds and applying them to a target table remains a significant
>>>>>> challenge.
>>>>>>
>>>>>> Common patterns like SCD Type 1 (maintaining a 1:1 replica) and SCD
>>>>>> Type 2 (tracking full change history) often require hand-crafted,
>>>>>> complex MERGE logic. In distributed systems, these implementations
>>>>>> are frequently error-prone when handling deletions or out-of-order data.
>>>>>> Proposal
>>>>>>
>>>>>> This SPIP proposes a new "Auto CDC" flow type for Spark. It
>>>>>> encapsulates the complex logic for SCD types and out-of-order data,
>>>>>> allowing data engineers to configure a declarative flow instead of 
>>>>>> writing
>>>>>> manual MERGE statements. This feature will be available in both Python
>>>>>> and SQL.
>>>>>> Example SQL:
>>>>>> -- Produce a change feed
>>>>>> CREATE STREAMING TABLE cdc.users AS
>>>>>> SELECT * FROM STREAM my_table CHANGES FROM VERSION 10;
>>>>>>
>>>>>> -- Consume the change feed
>>>>>> CREATE FLOW flow
>>>>>> AS AUTO CDC INTO
>>>>>>   target
>>>>>> FROM stream(cdc_data.users)
>>>>>>   KEYS (userId)
>>>>>>   APPLY AS DELETE WHEN operation = "DELETE"
>>>>>>   SEQUENCE BY sequenceNum
>>>>>>   COLUMNS * EXCEPT (operation, sequenceNum)
>>>>>>   STORED AS SCD TYPE 2
>>>>>>   TRACK HISTORY ON * EXCEPT (city);
>>>>>>
>>>>>>
>>>>>> Please review the full SPIP for the technical details. Looking
>>>>>> forward to your feedback and discussion!
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> Andreas
>>>>>>
>>>>>>
>>>>>>

Re: SPIP: Auto CDC support for Apache Spark

Reply via email to