Re: [DISCUSS] CEP-44: Kafka integration for Cassandra CDC using Sidecar

Štefan Miklošovič Mon, 12 Jan 2026 10:02:34 -0800

Hi James,

great stuff! I hope this set of patches is of the same quality as
until-now implementation of Analytics which proved to be quite easy to
embed into 3rd party projects which use Analytics as a platform to get
data from. I am particularly interested in reusing Avro codecs
(translation of InternalRow into its Avro representation) as that
opens up a lot of possibilities for further integrations.


I should take a look at the current state of that, definitely
something to consider to take a closer look at as the next release of
Analytics is due.

Regards

On Mon, Jan 12, 2026 at 6:54 PM James Berragan <[email protected]> wrote:
>
> Hello all,
>
> Just to update everyone about CEP-44 after a period of radio silence. Last 
> week CASSSIDECAR-243 was merged in, which introduces end-to-end CDC 
> functionality to the Apache Cassandra Sidecar for the first time. While I'm 
> sure there will be more features and additions (CASSANDRA-21011 is already in 
> the works) this marks a major milestone in delivering CEP-44.
>
> The bulk of the work was completed across the Apache Cassandra Analytics and 
> Apache Cassandra Sidecar subprojects with the following PRs:
>
> https://github.com/apache/cassandra-analytics/pull/87
>
> https://github.com/apache/cassandra-analytics/pull/99
>
> https://github.com/apache/cassandra-analytics/pull/101
>
>
> https://github.com/apache/cassandra-sidecar/pull/294
>
> https://github.com/apache/cassandra-sidecar/pull/147
>
> https://github.com/apache/cassandra-sidecar/pull/158
>
> https://github.com/apache/cassandra-sidecar/pull/189
>
> https://github.com/apache/cassandra-sidecar/pull/193
>
>
> If you are interested in delving deeper into the code, these are the best 
> places to start:
>
> https://github.com/apache/cassandra-analytics/tree/trunk/cassandra-analytics-cdc
>  - implementation agnostic core CDC module.
>
> https://github.com/apache/cassandra-analytics/tree/trunk/cassandra-analytics-cdc-sidecar
>  - CDC implementation using the Sidecar APIs.
>
> https://github.com/apache/cassandra-analytics/tree/trunk/cassandra-analytics-cdc-codec
>  - Avro codecs and Kafka integration.
>
> https://github.com/apache/cassandra-sidecar/tree/trunk/server/src/main/java/org/apache/cassandra/sidecar/cdc
>  - CDC implementation built into the Apache Cassandra Sidecar.
>
>
> Many thanks and congratulations to Bernardo Botella Corbi, Jyothsna Konisa, 
> Yifan Cai (apologies if I'm forgetting anyone) for their contributions and 
> reviews.
>
> Thanks,
> James.
>
> On Tue, 15 Oct 2024 at 09:57, Bernardo Botella <[email protected]> 
> wrote:
>>
>> Hi everyone,
>>
>> I’d like to make one final round of feedback request for this CEP-44: Kafka 
>> integration for Cassandra CDC using Sidecar before calling in a vote. We’ll 
>> leave it open for a few more days, and if nothing else comes in, we will 
>> call in a vote.
>>
>> Bernardo
>>
>> On Oct 1, 2024, at 6:58 AM, James Berragan <[email protected]> wrote:
>>
>> It seems this has triggered some important discussions about CEP-1 and the 
>> Sidecar. Let's keep those in their respective threads and focus this 
>> conversation on CEP-44.
>>
>> Patrick, I think I missed your point "There is also little mention of where 
>> the increased resource load would be handled." - you're right, running CDC 
>> in the Sidecar implicitly means it uses additional resources in the C* 
>> cluster. This resource usage is proportional to the write throughput, so 
>> it's not suitable for use cases with very high write throughput, but our 
>> experience has been that for standard mixed workloads the overhead is 
>> minimal. The throttling built in safely handles burst workloads.
>>
>> James.
>>
>> On Mon, 30 Sept 2024 at 14:22, Josh McKenzie <[email protected]> wrote:
>>>
>>> This is the type of hidden subproject that will get us into trouble with 
>>> the board/foundation.   I'm sure it's getting enough committer eyeballs, 
>>> and some PMC oversight, but maybe not enough.
>>>
>>> I don't agree with the qualifier of it as being hidden. It's definitely 
>>> lower traffic than the main project but there's movement on the JIRA here: 
>>> link.
>>>
>>> I assume the sidecar is going to take longer to reach a tipping point where 
>>> more people start contributing to it until it has compelling features 
>>> that'll incentivize folks running their own bespoke sidecars to migrate 
>>> over.
>>>
>>> Agree with all your points Jon; there's a lot to be done on it.
>>>
>>> CEP-1 is pretty much abandoned yeah. I think it'd be reasonable to close it 
>>> down and open up a new one w/active contributors + active shepherd and a 
>>> much more limited scope.
>>>
>>> On Mon, Sep 30, 2024, at 2:13 PM, Patrick McFadin wrote:
>>>
>>> I'm mentioning it because I was surprised and I feel like I generally have 
>>> a finger on the pulse of the project.
>>>
>>> I would love to talk about it more and get more community support and 
>>> interest.
>>>
>>> On Mon, Sep 30, 2024 at 11:01 AM Mick Semb Wever <[email protected]> wrote:
>>>
>>> Agree with Jon, Josh and Patrick here.
>>>
>>> This is the type of hidden subproject that will get us into trouble with 
>>> the board/foundation.   I'm sure it's getting enough committer eyeballs, 
>>> and some PMC oversight, but maybe not enough.  Addressing the more material 
>>> points that Jon mentions is the best way to deal with that IMHO.
>>>
>>>
>>>
>>> On Mon, 30 Sept 2024 at 20:37, Jon Haddad <[email protected]> wrote:
>>>
>>> I think it depends on what lens you're looking at the sidecar through.
>>>
>>> If you're actively working on it, and pulling it into your own infra, sure. 
>>>  It's a thing.
>>>
>>> If you're an outsider?  I have a hard time seeing it.
>>>
>>> - No documentation as to what it does
>>> - No releases
>>> - No build instructions
>>> - Trying to build using standard gradle commands fails [1]
>>> - Included configs don't work out of the box. [2][3]
>>> - CEP-1 looks abandonded
>>> - The primary reason right now to use it looks to be analytics library, 
>>> which doesn't work for most teams due to lack of vnode support [4]
>>>
>>> I think if you were to take a poll of 100 users outside this ML, I'd bet 
>>> almost every one would agree the sidecar isn't a thing yet, and that's 
>>> probably more important than if it's actually getting worked on.  I think 
>>> it has quite a ways to go before it looks to be more than an idea that's 
>>> incubating.
>>>
>>> [1] https://issues.apache.org/jira/browse/CASSANDRASC-120
>>> [2 https://issues.apache.org/jira/browse/CASSANDRASC-121
>>> [3] https://issues.apache.org/jira/browse/CASSANDRASC-122
>>> [4] https://issues.apache.org/jira/browse/CASSANDRA-19594
>>>
>>>
>>> On Mon, Sep 30, 2024 at 11:14 AM Josh McKenzie <[email protected]> wrote:
>>>
>>>
>>> The CEP for the sidecar has stalled. The sidecar itself is very much alive 
>>> and a thing.
>>>
>>> CEP != artifact.
>>>
>>> We should definitely clean that up though.
>>>
>>> On Mon, Sep 30, 2024, at 10:59 AM, Dinesh Joshi wrote:
>>>
>>> Patrick, could you please elaborate? The Sidecar has been a thing for a 
>>> while now.
>>>
>>> On Mon, Sep 30, 2024 at 7:51 AM Patrick McFadin <[email protected]> wrote:
>>>
>>> I made the mistake of asking two things in one email.
>>>
>>> First thing I asked. Sidecar? Stalled CEP so why is this being talked about 
>>> like this is a thing?
>>>
>>> On Mon, Sep 30, 2024 at 7:21 AM Benedict <[email protected]> wrote:
>>>
>>>
>>> Sorry Bernardo, you may have misunderstood me. I don’t have any concerns, I 
>>> was suggesting a possible future scenario where CDC for Kafka via sidecar 
>>> is changed to use a hypothetical future topic subscription service provided 
>>> by C*. It was meant to show that this CEP may be easily decoupled from any 
>>> future evolution in this area.
>>>
>>>
>>> On 30 Sep 2024, at 14:58, Bernardo Botella <[email protected]> 
>>> wrote:
>>>
>>> Thanks everyone for the comments.
>>>
>>>
>>> Patrick:
>>> The proposal includes a “best effort” approach for deduplication (some 
>>> details can be found on the Digest class comments on the PR here 
>>> https://github.com/apache/cassandra-analytics/pull/87/files#diff-3a09caecc1da13419d92cde56a7cfc7d253faac08182e6c2768b3d32c015de82R185-R193
>>>  ). That alone won’t eliminate all the duplicates, but as Josh points out, 
>>> it moves the line to something way easier to handle for consumers, and 
>>> definitely on the direction we should aim towards. Accord is definitely 
>>> something this contribution will benefit from, that will move that line 
>>> even further.
>>>
>>> Benedict:
>>> If I understand it correctly, your concern is that Kafka is somewhat the 
>>> hardcoded option for a CDC stream being published? The proposal introduces 
>>> a concept of data sources and sinks 
>>> (https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=323488575#CEP44:KafkaintegrationforCassandraCDCusingSidecar-SourcesandSinks)
>>>  being kafka the first implemented data sink. That means that the actual 
>>> Kafka output should (will) be something pluggable.
>>>
>>>
>>>
>>> On Sep 30, 2024, at 5:43 AM, Josh McKenzie <[email protected]> wrote:
>>>
>>> I don't see much on how this would be handled other than "left to the end 
>>> user to figure out."
>>>
>>> My immediate thought when I read that was "Yes. But it's moving where we 
>>> draw the line of 'left to the end user to figure out' much further than it 
>>> was before".
>>>
>>> This should only be necessary in edge cases w/extended severe degraded 
>>> availability where you can't hit QUORUM w/this design. So we go from 
>>> "De-dupe literally everything o ye' user" to "de-dupe a small fraction of a 
>>> % of the time when things really go off the rails".
>>>
>>> It still leaves the burden of processing potential duplicates downstream, 
>>> so some complexity burden on the users remains if they have no tolerance 
>>> for processing duplicate messages, however the underlying machine resource 
>>> utilization (from "dedupe everything" to "dedupe a small % of things") is 
>>> pretty massively shifted by this design change. That, and using the hash of 
>>> the mutation the way the extended design does is something a downstream 
>>> consumer could also do on their side to ensure anything that came in past 
>>> the drop-off window wasn't already seen. So not too painful; certainly a 
>>> vast improvement over the status quo.
>>>
>>> As to TCM and Accord: absolutely agree. I'd love to see a world where we 
>>> Accord everything and fire off CDC to subscribers from a coordinator 
>>> bypassing all this LSM-bastardized post-processing for CDC for instance. 
>>> That said, this is a functionality users needed back in... 2016? When we 
>>> first added CDC. So I think it's worth it to move on it now while retaining 
>>> architectural options to move to updated metadata and transactions as they 
>>> mature (obviously we'll lean on TCM since it's in 5.0 / trunk right now; 
>>> more applies to the accord bit).
>>>
>>> On Mon, Sep 30, 2024, at 3:20 AM, Benedict wrote:
>>>
>>>
>>> Yes, with accord it should be fairly easy to have reliable no-dupe log 
>>> streaming without an elected leader. Given the broad set of use cases, I 
>>> can imagine supporting some more native topic subscription API in C* rather 
>>> than requiring Kafka, so perhaps any integration of Kafka with the sidecar 
>>> can be considered a separate parallel effort, that might eventually 
>>> implement itself with this C* feature whenever it materialises?
>>>
>>>
>>> On 30 Sep 2024, at 03:42, Jeff Jirsa <[email protected]> wrote:
>>>
>>> 
>>>
>>> Transactional metadata and Accord should make it MUCH easier to do 
>>> duplication avoiding CDC (and I was going to note that someone should 
>>> ensure that the interfaces exposed to the public are stable enough not to 
>>> change the published api once those exist)
>>>
>>>
>>>
>>> On Sep 29, 2024, at 7:04 PM, Patrick McFadin <[email protected]> wrote:
>>>
>>> 
>>> As I was reviewing this, it occurred to me that it was talking about 
>>> Sidecar like it was a thing but that CEP has been stalled for quite some 
>>> time:  
>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=95652224
>>>
>>> If work on this is being done, should we get this official and wrapped up?
>>>
>>> On to the proposal...
>>>
>>> This has been a topic on the project for over 10 years now. I've seen 
>>> multiple goes at making this work and the issue that always turns out to 
>>> torpedo the project is handing dupes. To the point that they go from a 
>>> generalized Kafka producer engine to something specific to a particular use 
>>> case. I don't see much on how this would be handled other than "left to the 
>>> end user to figure out."
>>>
>>> There is also little mention of where the increased resource load would be 
>>> handled.
>>>
>>> This has been discussed many times before, but is it time to introduce the 
>>> concept of an elected leader for a token range for this type of operation? 
>>> It would eliminate a ton of problems that need to managed when bridging c* 
>>> to a system like Kafka. Last time it was discussed in earnest was for 
>>> KIP-30: 
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-30+-+Allow+for+brokers+to+have+plug-able+consensus+and+meta+data+storage+sub+systems
>>>
>>> Patrick
>>>
>>> On Sat, Sep 28, 2024 at 11:44 AM Jon Haddad <[email protected]> 
>>> wrote:
>>>
>>> Yes! I’m really looking forward to trying this out. The CEP looks really 
>>> well thought out. I think this will make CDC a lot more useful for a lot of 
>>> teams.
>>> Jon
>>>
>>>
>>> On Fri, Sep 27, 2024 at 4:23 PM Josh McKenzie <[email protected]> wrote:
>>>
>>>
>>> Really excited to see this hit the ML James.
>>>
>>> As author of the base CDC (get your stones ready for throwing :D) and 
>>> someone moderately involved in the CEP here, definitely welcome any 
>>> questions. CDC is a thorny problem in a multi-replica distributed system 
>>> like this.
>>>
>>> On Fri, Sep 27, 2024, at 5:40 PM, James Berragan wrote:
>>>
>>> Hi everyone,
>>>
>>> Wiki: 
>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-44%3A+Kafka+integration+for+Cassandra+CDC+using+Sidecar
>>>
>>> We would like to propose this CEP for adoption by the community.
>>>
>>> CDC is a common technique in databases but right now there is no 
>>> out-of-the-box solution to do this easily and at scale with Cassandra. Our 
>>> proposal is to build a fully-fledged solution into the Apache Cassandra 
>>> Sidecar. This comes with a number of benefits:
>>> - Sidecar is an official part of the existing Cassandra eco-system.
>>> - Sidecar runs co-located with Cassandra instances and so scales with the 
>>> cluster size.
>>> - Sidecar can access the underlying Cassandra database to store CDC 
>>> configuration and the CDC state in a special table.
>>> - Running in the Sidecar does not require additional external resources to 
>>> run.
>>>
>>> The core CDC module we anticipate will be pluggable and re-usable, it is 
>>> available for review here: 
>>> https://github.com/apache/cassandra-analytics/pull/87. The remaining 
>>> Sidecar code will follow.
>>>
>>> As a reminder, please keep the discussion here on the dev list vs. in the 
>>> wiki, as we’ve found it easier to manage via email.
>>>
>>> Sincerely,
>>> James Berragan
>>> Bernardo Botella Corbi
>>> Yifan Cai
>>> Jyothsna Konisa
>>>
>>>
>>>
>>

Re: [DISCUSS] CEP-44: Kafka integration for Cassandra CDC using Sidecar

Reply via email to