I think there is a confusion - IIUC, recent proposals about CDC are mostly
assuming simple transformation ('stateless' in terms of streaming query).
CDC events will still be considered as append-only for the operators,
rather than the operators truly understanding how to retract over
deletion/update. Making CDC events to be truly reflected in the semantic of
the query would require a lot of effort and probably much bigger scope than
current proposals about CDC.On Sun, Mar 29, 2026 at 8:42 AM vaquar khan <[email protected]> wrote: > HI . > Thanks for the SPIP. I fully support the goal-abstracting CDC merge logic > is a huge win for the community. However, looking at the current Spark > versions, there are significant architectural gaps between Databricks > Lakeflow's proprietary implementation and OSS Spark. > > A few technical blockers need clarification before we move forward: > > - OSS Compatibility: Databricks documentation explicitly states that the > AUTO CDC APIs are not supported by Apache Spark Declarative Pipelines > <https://docs.databricks.com/gcp/en/ldp/cdc>. > > - Streaming MERGE: The proposed flow requires continuous upsert/delete > semantics, but Dataset.mergeInto() currently does not support streaming > queries. Does this SPIP introduce an entirely new execution path to bypass > this restriction? > > - Tombstone Garbage Collection: Handling stream deletes safely requires > state store tombstone retention (e.g., configuring > pipelines.cdc.tombstoneGCThresholdInSeconds) to prevent late-arriving data > from resurrecting deleted keys. How will this be implemented natively in > OSS Spark state stores? > > - Sequencing Constraints: SEQUENCE BY enforces strict ordering where NULL > sequencing values are explicitly not supported. How will the engine handle > malformed or non-monotonic upstream sequences compared to our existing > time-based watermarks? > > - Given the massive surface area (new SQL DDL, streaming MERGE paths, SCD > Type 1/2 state logic, tombstone GC), a phased delivery plan would be very > helpful. It would also clarify exactly which Lakeflow components are being > contributed to open-source versus what needs to be rebuilt from scratch. > > > Best regards, > Viquar Khan > > On Sat, 28 Mar 2026 at 08:35, 陈 小健 <[email protected]> wrote: > >> unsubscribe >> >> 获取Outlook for Android <https://aka.ms/AAb9ysg> >> ------------------------------ >> *From:* Andreas Neumann <[email protected]> >> *Sent:* Saturday, March 28, 2026 2:43:54 AM >> *To:* [email protected] <[email protected]> >> *Subject:* Re: SPIP: Auto CDC support for Apache Spark >> >> Hi Vaibhav, >> >> The goal of this proposal is not to replace MERGE but to provide a simple >> abstraction for the common use case of CDC. >> MERGE itself is a very powerful operator and there will always be use >> cases outside of CDC that will require MERGE. >> >> And thanks for spotting the typo in the SPIP. It is fixed now! >> >> Cheers -Andreas >> >> >> On Fri, Mar 27, 2026 at 10:53 AM Vaibhav Kumar <[email protected]> >> wrote: >> >> Hi Andrew, >> >> Thanks for sharing the SPIP, Does that mean the MERGE statement would be >> deprecated? Also I think there was a small typo I have suggested in the >> doc. >> >> Regards, >> Vaibhav >> >> On Fri, Mar 27, 2026 at 10:15 AM DB Tsai <[email protected]> wrote: >> >> +1 >> >> DB Tsai | https://www.dbtsai.com/ | PGP 42E5B25A8F7A82C1 >> >> On Mar 26, 2026, at 6:08 PM, Andreas Neumann <[email protected]> wrote: >> >> Hi all, >> >> I’d like to start a discussion on a new SPIP to introduce Auto CDC >> support to Apache Spark. >> >> - SPIP Document: >> >> https://docs.google.com/document/d/1Hp5BGEYJRHbk6J7XUph3bAPZKRQXKOuV1PEaqZMMRoQ/ >> - >> >> JIRA: <https://issues.apache.org/jira/browse/SPARK-55668> >> https://issues.apache.org/jira/browse/SPARK-5566 >> >> Motivation >> >> With the upcoming introduction of standardized CDC support >> <https://issues.apache.org/jira/browse/SPARK-55668>, Spark will soon >> have a unified way to produce change data feeds. However, consuming >> these feeds and applying them to a target table remains a significant >> challenge. >> >> Common patterns like SCD Type 1 (maintaining a 1:1 replica) and SCD Type >> 2 (tracking full change history) often require hand-crafted, complex >> MERGE logic. In distributed systems, these implementations are >> frequently error-prone when handling deletions or out-of-order data. >> Proposal >> >> This SPIP proposes a new "Auto CDC" flow type for Spark. It encapsulates >> the complex logic for SCD types and out-of-order data, allowing data >> engineers to configure a declarative flow instead of writing manual MERGE >> statements. This feature will be available in both Python and SQL. >> Example SQL: >> -- Produce a change feed >> CREATE STREAMING TABLE cdc.users AS >> SELECT * FROM STREAM my_table CHANGES FROM VERSION 10; >> >> -- Consume the change feed >> CREATE FLOW flow >> AS AUTO CDC INTO >> target >> FROM stream(cdc_data.users) >> KEYS (userId) >> APPLY AS DELETE WHEN operation = "DELETE" >> SEQUENCE BY sequenceNum >> COLUMNS * EXCEPT (operation, sequenceNum) >> STORED AS SCD TYPE 2 >> TRACK HISTORY ON * EXCEPT (city); >> >> >> Please review the full SPIP for the technical details. Looking forward to >> your feedback and discussion! >> >> Best regards, >> >> Andreas >> >> >>
