Re: SPIP: Auto CDC support for Apache Spark

Andreas Neumann Wed, 01 Apr 2026 07:59:29 -0700

Hi Mich,

I completely agree that changelogs are an important feature.
However, the objective of this SPIP is to ease the processing of a
changelog in order to produce a replica of the original source data.
Exposing the changelog on the source side is subject of this earlier
SPIP: Change
Data Capture (CDC) Support
<https://docs.google.com/document/d/1-4rCS3vsGIyhwnkAwPsEaqyUDg-AuVkdrYLotFPw0U0/edit?tab=t.0>
.


Cheers -Andreas


On Wed, Apr 1, 2026 at 4:53 AM Mich Talebzadeh <[email protected]>
wrote:

> I would suggest framing Auto CDC not only around MERGE/mutation, but also
> around* immutability and auditability.*
>
> In many real-world cases, the ability to reconstruct what happened
> (ordering of changes, intermediate states) is as important as computing the
> latest state. From this perspective, Auto CDC should treat immutable
> changelog data as a first-class abstraction, queryable via Spark SQL across
> storage backends.
> This enables:
>
>    - state reconstruction (point-in-time views)
>    - audit / forensic analysis
>    - validation of upstream processes
>
> MERGE then becomes just one materialisation strategy, rather than the core
> abstraction.
>
> From a connector standpoint, exposing complete, ordered changelogs is more
> fundamental than supporting MERGE alone.
>
> HTH
>
> Dr Mich Talebzadeh,
> Data Scientist | Distributed Systems (Spark) | Financial Forensics &
> Metadata Analytics | Transaction Reconstruction | Audit & Evidence-Based
> Analytics
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
>
>
>
> On Tue, 31 Mar 2026 at 18:09, Anton Okolnychyi <[email protected]>
> wrote:
>
>> I think this is going to be a great feature if it naturally extends the
>> current capabilities of Spark and complements / builds on top of changelogs
>> and MERGE.
>>
>>  I would be really worried otherwise.
>>
>> On Tue, Mar 31, 2026 at 6:02 AM Andreas Neumann <[email protected]> wrote:
>>
>>> Hi Mich,
>>>
>>> I agree that is, in theory, possible to implement this for connectors
>>> that do not support MERGE. But it is our intention to rely on MERGE support
>>> for this feature. If there is large interest for it, we could follow up
>>> with an implementation that does not require MERGE, but I would want
>>> to evaluate first whether the additional complexity is justified.
>>>
>>> Note also that we will not require changelogs. That is a requirement for
>>> the source that produces the CDC feed, as addressed by this existing SPIP: 
>>> SPIP:
>>> Change Data Capture (CDC) Support
>>> <https://docs.google.com/document/d/1-4rCS3vsGIyhwnkAwPsEaqyUDg-AuVkdrYLotFPw0U0/edit?usp=sharing>.
>>> I actually expect that many use cases will come from sources that do not
>>> support MERGE but do support change feeds.
>>>
>>> Cheers -Andreas
>>>
>>> On Mon, Mar 30, 2026 at 2:09 PM Mich Talebzadeh <
>>> [email protected]> wrote:
>>>
>>>> Hello,
>>>>
>>>>  Auto CDC may compile to a MERGE-like row-level plan, but the SPIP
>>>> should describe that as one implementation strategy, *not necessarily
>>>> the only one*. Connectors do not just need changelogs and MERGE; they
>>>> need changelog semantics on the read side and row-level mutation capability
>>>> on the write side, plus keys and usually sequencing.
>>>>
>>>> HTH
>>>>
>>>> Dr Mich Talebzadeh,
>>>> Data Scientist | Distributed Systems (Spark) | Financial Forensics &
>>>> Metadata Analytics | Transaction Reconstruction | Audit & Evidence-Based
>>>> Analytics
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>> On Mon, 30 Mar 2026 at 16:49, Anton Okolnychyi <[email protected]>
>>>> wrote:
>>>>
>>>>> Will auto CDC compile into or use MERGE under the hood? If yes, can we
>>>>> include a sketch of how that rewrite will look like in the SPIP? What
>>>>> exactly do connectors need to support to benefit from auto CDC? Just
>>>>> support changelogs and MERGE?
>>>>>
>>>>> - Anton
>>>>>
>>>>> пн, 30 бер. 2026 р. о 07:44 Andreas Neumann <[email protected]> пише:
>>>>>
>>>>>> Hi Vaquar,
>>>>>>
>>>>>> I responded to most of your comments on the document itself.
>>>>>> Addtional comments inline
>>>>>>
>>>>>> Cheers -Andreas
>>>>>>
>>>>>> On Sat, Mar 28, 2026 at 4:42 PM vaquar khan <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> HI .
>>>>>>> Thanks for the SPIP. I fully support the goal-abstracting CDC merge
>>>>>>> logic is a huge win for the community. However, looking at the current
>>>>>>> Spark versions, there are significant architectural gaps between 
>>>>>>> Databricks
>>>>>>> Lakeflow's proprietary implementation and OSS Spark.
>>>>>>>
>>>>>>> A few technical blockers need clarification before we move forward:
>>>>>>>
>>>>>>> - OSS Compatibility: Databricks documentation explicitly states that
>>>>>>> the AUTO CDC APIs are not supported by Apache Spark Declarative
>>>>>>> Pipelines <https://docs.databricks.com/gcp/en/ldp/cdc>.
>>>>>>>
>>>>>> That will change with the implementation of this SPIP.
>>>>>>
>>>>>>
>>>>>>> - Streaming MERGE: The proposed flow requires continuous
>>>>>>> upsert/delete semantics, but Dataset.mergeInto() currently does not 
>>>>>>> support
>>>>>>> streaming queries. Does this SPIP introduce an entirely new execution 
>>>>>>> path
>>>>>>> to bypass this restriction
>>>>>>>
>>>>>> This works with foreachBatch.
>>>>>>
>>>>>>
>>>>>>> - Tombstone Garbage Collection: Handling stream deletes safely
>>>>>>> requires state store tombstone retention (e.g., configuring
>>>>>>> pipelines.cdc.tombstoneGCThresholdInSeconds) to prevent late-arriving 
>>>>>>> data
>>>>>>> from resurrecting deleted keys. How will this be implemented natively in
>>>>>>> OSS Spark state stores?
>>>>>>>
>>>>>> That's an interesting question. Tombstones could be modeled in the
>>>>>> state store, but we are thinking that they will be modeled as an explicit
>>>>>> output of the flow, either as records in the output table with a "deleted
>>>>>> at" marker, possibly with a view on top to project away these rows; or 
>>>>>> in a
>>>>>> separate output that contains only the tombstones. The exact design is 
>>>>>> not
>>>>>> finalized, that is part of the first phase of the project.
>>>>>>
>>>>>> - Sequencing Constraints: SEQUENCE BY enforces strict ordering where
>>>>>>> NULL sequencing values are explicitly not supported. How will the engine
>>>>>>> handle malformed or non-monotonic upstream sequences compared to our
>>>>>>> existing time-based watermarks?
>>>>>>>
>>>>>> I think malformed change events should, at least in the first
>>>>>> iteration, fail the stream. Otherwise there is a risk of writing 
>>>>>> incorrect
>>>>>> data.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> - Given the massive surface area (new SQL DDL, streaming MERGE
>>>>>>> paths, SCD Type 1/2 state logic, tombstone GC), a phased delivery plan
>>>>>>> would be very helpful. It would also clarify exactly which Lakeflow
>>>>>>> components are being contributed to open-source versus what needs to be
>>>>>>> rebuilt from scratch.
>>>>>>>
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Viquar Khan
>>>>>>>
>>>>>>> On Sat, 28 Mar 2026 at 08:35, 陈 小健 <[email protected]> wrote:
>>>>>>>
>>>>>>>> unsubscribe
>>>>>>>>
>>>>>>>> 获取Outlook for Android <https://aka.ms/AAb9ysg>
>>>>>>>> ------------------------------
>>>>>>>> *From:* Andreas Neumann <[email protected]>
>>>>>>>> *Sent:* Saturday, March 28, 2026 2:43:54 AM
>>>>>>>> *To:* [email protected] <[email protected]>
>>>>>>>> *Subject:* Re: SPIP: Auto CDC support for Apache Spark
>>>>>>>>
>>>>>>>> Hi Vaibhav,
>>>>>>>>
>>>>>>>> The goal of this proposal is not to replace MERGE but to provide a
>>>>>>>> simple abstraction for the common use case of CDC.
>>>>>>>> MERGE itself is a very powerful operator and there will always be
>>>>>>>> use cases outside of CDC that will require MERGE.
>>>>>>>>
>>>>>>>> And thanks for spotting the typo in the SPIP. It is fixed now!
>>>>>>>>
>>>>>>>> Cheers -Andreas
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Mar 27, 2026 at 10:53 AM Vaibhav Kumar <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>> Hi Andrew,
>>>>>>>>
>>>>>>>> Thanks for sharing the SPIP, Does that mean the MERGE statement
>>>>>>>> would be deprecated? Also I think there was a small typo I have 
>>>>>>>> suggested
>>>>>>>> in the doc.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Vaibhav
>>>>>>>>
>>>>>>>> On Fri, Mar 27, 2026 at 10:15 AM DB Tsai <[email protected]> wrote:
>>>>>>>>
>>>>>>>> +1
>>>>>>>>
>>>>>>>> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
>>>>>>>>
>>>>>>>> On Mar 26, 2026, at 6:08 PM, Andreas Neumann <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I’d like to start a discussion on a new SPIP to introduce Auto CDC
>>>>>>>> support to Apache Spark.
>>>>>>>>
>>>>>>>>    - SPIP Document:
>>>>>>>>    
>>>>>>>> https://docs.google.com/document/d/1Hp5BGEYJRHbk6J7XUph3bAPZKRQXKOuV1PEaqZMMRoQ/
>>>>>>>>    -
>>>>>>>>
>>>>>>>>    JIRA: <https://issues.apache.org/jira/browse/SPARK-55668>
>>>>>>>>    https://issues.apache.org/jira/browse/SPARK-5566
>>>>>>>>
>>>>>>>> Motivation
>>>>>>>>
>>>>>>>> With the upcoming introduction of standardized CDC support
>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-55668>, Spark will
>>>>>>>> soon have a unified way to produce change data feeds. However,
>>>>>>>> consuming these feeds and applying them to a target table remains
>>>>>>>> a significant challenge.
>>>>>>>>
>>>>>>>> Common patterns like SCD Type 1 (maintaining a 1:1 replica) and SCD
>>>>>>>> Type 2 (tracking full change history) often require hand-crafted,
>>>>>>>> complex MERGE logic. In distributed systems, these implementations
>>>>>>>> are frequently error-prone when handling deletions or out-of-order 
>>>>>>>> data.
>>>>>>>> Proposal
>>>>>>>>
>>>>>>>> This SPIP proposes a new "Auto CDC" flow type for Spark. It
>>>>>>>> encapsulates the complex logic for SCD types and out-of-order data,
>>>>>>>> allowing data engineers to configure a declarative flow instead of 
>>>>>>>> writing
>>>>>>>> manual MERGE statements. This feature will be available in both Python
>>>>>>>> and SQL.
>>>>>>>> Example SQL:
>>>>>>>> -- Produce a change feed
>>>>>>>> CREATE STREAMING TABLE cdc.users AS
>>>>>>>> SELECT * FROM STREAM my_table CHANGES FROM VERSION 10;
>>>>>>>>
>>>>>>>> -- Consume the change feed
>>>>>>>> CREATE FLOW flow
>>>>>>>> AS AUTO CDC INTO
>>>>>>>>   target
>>>>>>>> FROM stream(cdc_data.users)
>>>>>>>>   KEYS (userId)
>>>>>>>>   APPLY AS DELETE WHEN operation = "DELETE"
>>>>>>>>   SEQUENCE BY sequenceNum
>>>>>>>>   COLUMNS * EXCEPT (operation, sequenceNum)
>>>>>>>>   STORED AS SCD TYPE 2
>>>>>>>>   TRACK HISTORY ON * EXCEPT (city);
>>>>>>>>
>>>>>>>>
>>>>>>>> Please review the full SPIP for the technical details. Looking
>>>>>>>> forward to your feedback and discussion!
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>>
>>>>>>>> Andreas
>>>>>>>>
>>>>>>>>
>>>>>>>>

Re: SPIP: Auto CDC support for Apache Spark

Reply via email to