Re: [VOTE] SPIP: Auto CDC Support for Apache Spark

huaxin gao Mon, 06 Apr 2026 10:22:46 -0700

+1

On Mon, Apr 6, 2026 at 10:10 AM Anton Okolnychyi <[email protected]>
wrote:


> +1 (non-binding)
>
> сб, 4 квіт. 2026 р. о 11:55 Gengliang Wang <[email protected]> пише:
>
>> +1
>>
>> On Sat, Apr 4, 2026 at 10:17 AM Xiao Li <[email protected]> wrote:
>>
>>> +1
>>>
>>> vaquar khan <[email protected]> 于2026年4月4日周六 09:45写道：
>>>
>>>> +1
>>>>
>>>> Regards,
>>>> Viquar Khan
>>>>
>>>> On Sat, 4 Apr 2026 at 11:14, Lisa N. Cao <[email protected]>
>>>> wrote:
>>>>
>>>>> +1 (non-binding)
>>>>>
>>>>> --
>>>>> LNC
>>>>>
>>>>> On Fri, Apr 3, 2026, 5:15 PM Shixiong Zhu <[email protected]> wrote:
>>>>>
>>>>>> +1
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 3, 2026 at 5:03 PM Mich Talebzadeh <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> +1
>>>>>>>
>>>>>>> Dr Mich Talebzadeh,
>>>>>>> Data Scientist | Distributed Systems (Spark) | Financial Forensics &
>>>>>>> Metadata Analytics | Transaction Reconstruction | Audit & Evidence-Based
>>>>>>> Analytics
>>>>>>>
>>>>>>>    view my Linkedin profile
>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, 3 Apr 2026 at 23:00, Andreas Neumann <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Spark devs,
>>>>>>>>
>>>>>>>> I'd like to call a vote on the SPIP*: Auto CDC Support for Apache
>>>>>>>> Spark*
>>>>>>>> Motivation
>>>>>>>>
>>>>>>>> With the upcoming introduction of standardized CDC support
>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-55668>, Spark will
>>>>>>>> soon have a unified way to produce change data feeds. However,
>>>>>>>> consuming these feeds and applying them to a target table remains
>>>>>>>> a significant challenge.
>>>>>>>>
>>>>>>>> Common patterns like SCD Type 1 (maintaining a 1:1 replica) and SCD
>>>>>>>> Type 2 (tracking full change history) often require hand-crafted,
>>>>>>>> complex MERGE logic. In distributed systems, these implementations
>>>>>>>> are frequently error-prone when handling deletions or out-of-order 
>>>>>>>> data.
>>>>>>>> Proposal
>>>>>>>>
>>>>>>>> This SPIP proposes a new "Auto CDC" flow type for Spark. It
>>>>>>>> encapsulates the complex logic for SCD types and out-of-order data,
>>>>>>>> allowing data engineers to configure a declarative flow instead of 
>>>>>>>> writing
>>>>>>>> manual MERGE statements. This feature will be available in both Python
>>>>>>>> and SQL.
>>>>>>>>
>>>>>>>> Example SQL:
>>>>>>>>
>>>>>>>> -- Produce a change feed
>>>>>>>>
>>>>>>>> CREATE STREAMING TABLE cdc.users AS
>>>>>>>>
>>>>>>>> SELECT * FROM STREAM my_table CHANGES FROM VERSION 10;
>>>>>>>>
>>>>>>>>
>>>>>>>> -- Consume the change feed
>>>>>>>>
>>>>>>>> CREATE FLOW flow
>>>>>>>>
>>>>>>>> AS AUTO CDC INTO
>>>>>>>>
>>>>>>>>   target
>>>>>>>>
>>>>>>>> FROM stream(cdc_data.users)
>>>>>>>>
>>>>>>>>   KEYS (userId)
>>>>>>>>
>>>>>>>>   APPLY AS DELETE WHEN operation = "DELETE"
>>>>>>>>
>>>>>>>>   SEQUENCE BY sequenceNum
>>>>>>>>
>>>>>>>>   COLUMNS * EXCEPT (operation, sequenceNum)
>>>>>>>>
>>>>>>>>   STORED AS SCD TYPE 2
>>>>>>>>
>>>>>>>>   TRACK HISTORY ON * EXCEPT (city);
>>>>>>>>
>>>>>>>>
>>>>>>>> *Relevant Links:*
>>>>>>>>
>>>>>>>>    - SPIP Document:
>>>>>>>>    
>>>>>>>> https://docs.google.com/document/d/1Hp5BGEYJRHbk6J7XUph3bAPZKRQXKOuV1PEaqZMMRoQ/
>>>>>>>>    -
>>>>>>>>
>>>>>>>>    *Discussion Thread: *
>>>>>>>>    https://lists.apache.org/thread/j6sj9wo9odgdpgzlxtvhoy7szs0jplf7
>>>>>>>>    -
>>>>>>>>
>>>>>>>>    JIRA: <https://issues.apache.org/jira/browse/SPARK-55668>
>>>>>>>>    https://issues.apache.org/jira/browse/SPARK-56249
>>>>>>>>
>>>>>>>> *The vote will be open for at least 72 hours. *Please vote:
>>>>>>>>
>>>>>>>> [ ] +1: Accept the proposal as an official SPIP
>>>>>>>> [ ] +0
>>>>>>>> [ ] -1: I don't think this is a good idea because ...
>>>>>>>> Cheers -Andreas
>>>>>>>>
>>>>>>>>
>>>>>>>>

Re: [VOTE] SPIP: Auto CDC Support for Apache Spark

Reply via email to