I think this is going to be a great feature if it naturally extends the current capabilities of Spark and complements / builds on top of changelogs and MERGE.
I would be really worried otherwise. On Tue, Mar 31, 2026 at 6:02 AM Andreas Neumann <[email protected]> wrote: > Hi Mich, > > I agree that is, in theory, possible to implement this for connectors that > do not support MERGE. But it is our intention to rely on MERGE support for > this feature. If there is large interest for it, we could follow up with an > implementation that does not require MERGE, but I would want to evaluate > first whether the additional complexity is justified. > > Note also that we will not require changelogs. That is a requirement for > the source that produces the CDC feed, as addressed by this existing SPIP: > SPIP: > Change Data Capture (CDC) Support > <https://docs.google.com/document/d/1-4rCS3vsGIyhwnkAwPsEaqyUDg-AuVkdrYLotFPw0U0/edit?usp=sharing>. > I actually expect that many use cases will come from sources that do not > support MERGE but do support change feeds. > > Cheers -Andreas > > On Mon, Mar 30, 2026 at 2:09 PM Mich Talebzadeh <[email protected]> > wrote: > >> Hello, >> >> Auto CDC may compile to a MERGE-like row-level plan, but the SPIP should >> describe that as one implementation strategy, *not necessarily the only >> one*. Connectors do not just need changelogs and MERGE; they need >> changelog semantics on the read side and row-level mutation capability on >> the write side, plus keys and usually sequencing. >> >> HTH >> >> Dr Mich Talebzadeh, >> Data Scientist | Distributed Systems (Spark) | Financial Forensics & >> Metadata Analytics | Transaction Reconstruction | Audit & Evidence-Based >> Analytics >> >> view my Linkedin profile >> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >> >> >> On Mon, 30 Mar 2026 at 16:49, Anton Okolnychyi <[email protected]> >> wrote: >> >>> Will auto CDC compile into or use MERGE under the hood? If yes, can we >>> include a sketch of how that rewrite will look like in the SPIP? What >>> exactly do connectors need to support to benefit from auto CDC? Just >>> support changelogs and MERGE? >>> >>> - Anton >>> >>> пн, 30 бер. 2026 р. о 07:44 Andreas Neumann <[email protected]> пише: >>> >>>> Hi Vaquar, >>>> >>>> I responded to most of your comments on the document itself. >>>> Addtional comments inline >>>> >>>> Cheers -Andreas >>>> >>>> On Sat, Mar 28, 2026 at 4:42 PM vaquar khan <[email protected]> >>>> wrote: >>>> >>>>> HI . >>>>> Thanks for the SPIP. I fully support the goal-abstracting CDC merge >>>>> logic is a huge win for the community. However, looking at the current >>>>> Spark versions, there are significant architectural gaps between >>>>> Databricks >>>>> Lakeflow's proprietary implementation and OSS Spark. >>>>> >>>>> A few technical blockers need clarification before we move forward: >>>>> >>>>> - OSS Compatibility: Databricks documentation explicitly states that >>>>> the AUTO CDC APIs are not supported by Apache Spark Declarative >>>>> Pipelines <https://docs.databricks.com/gcp/en/ldp/cdc>. >>>>> >>>> That will change with the implementation of this SPIP. >>>> >>>> >>>>> - Streaming MERGE: The proposed flow requires continuous upsert/delete >>>>> semantics, but Dataset.mergeInto() currently does not support streaming >>>>> queries. Does this SPIP introduce an entirely new execution path to bypass >>>>> this restriction >>>>> >>>> This works with foreachBatch. >>>> >>>> >>>>> - Tombstone Garbage Collection: Handling stream deletes safely >>>>> requires state store tombstone retention (e.g., configuring >>>>> pipelines.cdc.tombstoneGCThresholdInSeconds) to prevent late-arriving data >>>>> from resurrecting deleted keys. How will this be implemented natively in >>>>> OSS Spark state stores? >>>>> >>>> That's an interesting question. Tombstones could be modeled in the >>>> state store, but we are thinking that they will be modeled as an explicit >>>> output of the flow, either as records in the output table with a "deleted >>>> at" marker, possibly with a view on top to project away these rows; or in a >>>> separate output that contains only the tombstones. The exact design is not >>>> finalized, that is part of the first phase of the project. >>>> >>>> - Sequencing Constraints: SEQUENCE BY enforces strict ordering where >>>>> NULL sequencing values are explicitly not supported. How will the engine >>>>> handle malformed or non-monotonic upstream sequences compared to our >>>>> existing time-based watermarks? >>>>> >>>> I think malformed change events should, at least in the first >>>> iteration, fail the stream. Otherwise there is a risk of writing incorrect >>>> data. >>>> >>>> >>>>> >>>>> - Given the massive surface area (new SQL DDL, streaming MERGE paths, >>>>> SCD Type 1/2 state logic, tombstone GC), a phased delivery plan would be >>>>> very helpful. It would also clarify exactly which Lakeflow components are >>>>> being contributed to open-source versus what needs to be rebuilt from >>>>> scratch. >>>>> >>>>> >>>>> Best regards, >>>>> Viquar Khan >>>>> >>>>> On Sat, 28 Mar 2026 at 08:35, 陈 小健 <[email protected]> wrote: >>>>> >>>>>> unsubscribe >>>>>> >>>>>> 获取Outlook for Android <https://aka.ms/AAb9ysg> >>>>>> ------------------------------ >>>>>> *From:* Andreas Neumann <[email protected]> >>>>>> *Sent:* Saturday, March 28, 2026 2:43:54 AM >>>>>> *To:* [email protected] <[email protected]> >>>>>> *Subject:* Re: SPIP: Auto CDC support for Apache Spark >>>>>> >>>>>> Hi Vaibhav, >>>>>> >>>>>> The goal of this proposal is not to replace MERGE but to provide a >>>>>> simple abstraction for the common use case of CDC. >>>>>> MERGE itself is a very powerful operator and there will always be use >>>>>> cases outside of CDC that will require MERGE. >>>>>> >>>>>> And thanks for spotting the typo in the SPIP. It is fixed now! >>>>>> >>>>>> Cheers -Andreas >>>>>> >>>>>> >>>>>> On Fri, Mar 27, 2026 at 10:53 AM Vaibhav Kumar < >>>>>> [email protected]> wrote: >>>>>> >>>>>> Hi Andrew, >>>>>> >>>>>> Thanks for sharing the SPIP, Does that mean the MERGE statement would >>>>>> be deprecated? Also I think there was a small typo I have suggested in >>>>>> the >>>>>> doc. >>>>>> >>>>>> Regards, >>>>>> Vaibhav >>>>>> >>>>>> On Fri, Mar 27, 2026 at 10:15 AM DB Tsai <[email protected]> wrote: >>>>>> >>>>>> +1 >>>>>> >>>>>> DB Tsai | https://www.dbtsai.com/ | PGP 42E5B25A8F7A82C1 >>>>>> >>>>>> On Mar 26, 2026, at 6:08 PM, Andreas Neumann <[email protected]> wrote: >>>>>> >>>>>> Hi all, >>>>>> >>>>>> I’d like to start a discussion on a new SPIP to introduce Auto CDC >>>>>> support to Apache Spark. >>>>>> >>>>>> - SPIP Document: >>>>>> >>>>>> https://docs.google.com/document/d/1Hp5BGEYJRHbk6J7XUph3bAPZKRQXKOuV1PEaqZMMRoQ/ >>>>>> - >>>>>> >>>>>> JIRA: <https://issues.apache.org/jira/browse/SPARK-55668> >>>>>> https://issues.apache.org/jira/browse/SPARK-5566 >>>>>> >>>>>> Motivation >>>>>> >>>>>> With the upcoming introduction of standardized CDC support >>>>>> <https://issues.apache.org/jira/browse/SPARK-55668>, Spark will soon >>>>>> have a unified way to produce change data feeds. However, consuming >>>>>> these feeds and applying them to a target table remains a significant >>>>>> challenge. >>>>>> >>>>>> Common patterns like SCD Type 1 (maintaining a 1:1 replica) and SCD >>>>>> Type 2 (tracking full change history) often require hand-crafted, >>>>>> complex MERGE logic. In distributed systems, these implementations >>>>>> are frequently error-prone when handling deletions or out-of-order data. >>>>>> Proposal >>>>>> >>>>>> This SPIP proposes a new "Auto CDC" flow type for Spark. It >>>>>> encapsulates the complex logic for SCD types and out-of-order data, >>>>>> allowing data engineers to configure a declarative flow instead of >>>>>> writing >>>>>> manual MERGE statements. This feature will be available in both Python >>>>>> and SQL. >>>>>> Example SQL: >>>>>> -- Produce a change feed >>>>>> CREATE STREAMING TABLE cdc.users AS >>>>>> SELECT * FROM STREAM my_table CHANGES FROM VERSION 10; >>>>>> >>>>>> -- Consume the change feed >>>>>> CREATE FLOW flow >>>>>> AS AUTO CDC INTO >>>>>> target >>>>>> FROM stream(cdc_data.users) >>>>>> KEYS (userId) >>>>>> APPLY AS DELETE WHEN operation = "DELETE" >>>>>> SEQUENCE BY sequenceNum >>>>>> COLUMNS * EXCEPT (operation, sequenceNum) >>>>>> STORED AS SCD TYPE 2 >>>>>> TRACK HISTORY ON * EXCEPT (city); >>>>>> >>>>>> >>>>>> Please review the full SPIP for the technical details. Looking >>>>>> forward to your feedback and discussion! >>>>>> >>>>>> Best regards, >>>>>> >>>>>> Andreas >>>>>> >>>>>> >>>>>>
