Hi Nimrod,

For your awareness, I have opened a discussion thread on the mailing list.
You can find it here:
https://lists.apache.org/thread/dhxx6pohs7fvqc3knzhtoj4tbcgrwxts

On Sat, Feb 28, 2026 at 6:39 AM Ángel Álvarez Pascua <
[email protected]> wrote:

> I fully agree with your idea. A general, pluggable CDC framework in Spark
> would fill a real gap for integrating operational databases with lakehouse
> formats using Structured Streaming.
>
> I also believe it should integrate seamlessly with declarative pipelines,
> allowing users to declare intent (source, tables, sink, apply semantics)
> while Spark manages the underlying streaming jobs.
>
> El sáb, 28 feb 2026, 15:20, Nimrod Ofek <[email protected]> escribió:
>
>> I think that one is only for Delta tables - I mean something more general
>> with multiple pluggable sources- like Flink cdc - supporting cdc for sql
>> server, mysql, postgresql delta and iceberg for starter.
>> I think probably processing them with something like Spark structured
>> streaming - supporting cdc for various data sources and general databases.
>>
>> While Iceberg and Delta can be read from various engines, other data
>> sources like mysql, sql server etc. can't- so to share such tables one need
>> to have an easy way to transform those tables to Iceberg/ Delta for data
>> lakes (you can't read it all the time from the operational database).
>>
>> Thanks,
>> Nimrod
>>
>> בתאריך שבת, 28 בפבר׳ 2026, 15:54, מאת Ángel Álvarez Pascua ‏<
>> [email protected]>:
>>
>>> You mean something like AutoCDC from Databricks?
>>> https://docs.databricks.com/aws/en/ldp/cdc
>>>
>>> El sáb, 28 feb 2026, 10:47, Nimrod Ofek <[email protected]>
>>> escribió:
>>>
>>>> Hi all,
>>>>
>>>> I would like to start a discussion about the possibility for
>>>> implementation of a Change Data Capture (CDC) feature within Apache Spark,
>>>> similar to the existing, competing Flink CDC functionality
>>>> <https://nightlies.apache.org/flink/flink-cdc-docs-master/docs/connectors/flink-sources/overview/>
>>>> .
>>>>
>>>> I believe integrating such a feature would significantly enhance
>>>> Spark's capabilities for real-time data integration and ETL processes. I
>>>> would appreciate the opportunity to discuss how we might approach this
>>>> proposal.
>>>>
>>>> Thank you for your time and consideration.
>>>>
>>>> Best regards,
>>>> Nimrod
>>>>
>>>>

Reply via email to