Re: [VOTE] SPIP: Auto CDC Support for Apache Spark

Gengliang Wang Sat, 04 Apr 2026 12:02:23 -0700

+1

On Sat, Apr 4, 2026 at 10:17 AM Xiao Li <[email protected]> wrote:


> +1
>
> vaquar khan <[email protected]> 于2026年4月4日周六 09:45写道：
>
>> +1
>>
>> Regards,
>> Viquar Khan
>>
>> On Sat, 4 Apr 2026 at 11:14, Lisa N. Cao <[email protected]>
>> wrote:
>>
>>> +1 (non-binding)
>>>
>>> --
>>> LNC
>>>
>>> On Fri, Apr 3, 2026, 5:15 PM Shixiong Zhu <[email protected]> wrote:
>>>
>>>> +1
>>>>
>>>>
>>>> On Fri, Apr 3, 2026 at 5:03 PM Mich Talebzadeh <
>>>> [email protected]> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> Dr Mich Talebzadeh,
>>>>> Data Scientist | Distributed Systems (Spark) | Financial Forensics &
>>>>> Metadata Analytics | Transaction Reconstruction | Audit & Evidence-Based
>>>>> Analytics
>>>>>
>>>>>    view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, 3 Apr 2026 at 23:00, Andreas Neumann <[email protected]> wrote:
>>>>>
>>>>>> Hi Spark devs,
>>>>>>
>>>>>> I'd like to call a vote on the SPIP*: Auto CDC Support for Apache
>>>>>> Spark*
>>>>>> Motivation
>>>>>>
>>>>>> With the upcoming introduction of standardized CDC support
>>>>>> <https://issues.apache.org/jira/browse/SPARK-55668>, Spark will soon
>>>>>> have a unified way to produce change data feeds. However, consuming these
>>>>>> feeds and applying them to a target table remains a significant 
>>>>>> challenge.
>>>>>>
>>>>>> Common patterns like SCD Type 1 (maintaining a 1:1 replica) and SCD
>>>>>> Type 2 (tracking full change history) often require hand-crafted,
>>>>>> complex MERGE logic. In distributed systems, these implementations
>>>>>> are frequently error-prone when handling deletions or out-of-order data.
>>>>>> Proposal
>>>>>>
>>>>>> This SPIP proposes a new "Auto CDC" flow type for Spark. It
>>>>>> encapsulates the complex logic for SCD types and out-of-order data,
>>>>>> allowing data engineers to configure a declarative flow instead of 
>>>>>> writing
>>>>>> manual MERGE statements. This feature will be available in both Python
>>>>>> and SQL.
>>>>>>
>>>>>> Example SQL:
>>>>>>
>>>>>> -- Produce a change feed
>>>>>>
>>>>>> CREATE STREAMING TABLE cdc.users AS
>>>>>>
>>>>>> SELECT * FROM STREAM my_table CHANGES FROM VERSION 10;
>>>>>>
>>>>>>
>>>>>> -- Consume the change feed
>>>>>>
>>>>>> CREATE FLOW flow
>>>>>>
>>>>>> AS AUTO CDC INTO
>>>>>>
>>>>>>   target
>>>>>>
>>>>>> FROM stream(cdc_data.users)
>>>>>>
>>>>>>   KEYS (userId)
>>>>>>
>>>>>>   APPLY AS DELETE WHEN operation = "DELETE"
>>>>>>
>>>>>>   SEQUENCE BY sequenceNum
>>>>>>
>>>>>>   COLUMNS * EXCEPT (operation, sequenceNum)
>>>>>>
>>>>>>   STORED AS SCD TYPE 2
>>>>>>
>>>>>>   TRACK HISTORY ON * EXCEPT (city);
>>>>>>
>>>>>>
>>>>>> *Relevant Links:*
>>>>>>
>>>>>>    - SPIP Document:
>>>>>>    
>>>>>> https://docs.google.com/document/d/1Hp5BGEYJRHbk6J7XUph3bAPZKRQXKOuV1PEaqZMMRoQ/
>>>>>>    -
>>>>>>
>>>>>>    *Discussion Thread: *
>>>>>>    https://lists.apache.org/thread/j6sj9wo9odgdpgzlxtvhoy7szs0jplf7
>>>>>>    -
>>>>>>
>>>>>>    JIRA: <https://issues.apache.org/jira/browse/SPARK-55668>
>>>>>>    https://issues.apache.org/jira/browse/SPARK-56249
>>>>>>
>>>>>> *The vote will be open for at least 72 hours. *Please vote:
>>>>>>
>>>>>> [ ] +1: Accept the proposal as an official SPIP
>>>>>> [ ] +0
>>>>>> [ ] -1: I don't think this is a good idea because ...
>>>>>> Cheers -Andreas
>>>>>>
>>>>>>
>>>>>>

Re: [VOTE] SPIP: Auto CDC Support for Apache Spark

Reply via email to