Re: [DISCUSS] CEP-44: Kafka integration for Cassandra CDC using Sidecar

Bernardo Botella Tue, 15 Oct 2024 09:57:13 -0700

Hi everyone,

I’d like to make one final round of feedback request for this CEP-44: Kafka 
integration for Cassandra CDC using Sidecar before calling in a vote. We’ll 
leave it open for a few more days, and if nothing else comes in, we will call 
in a vote.


Bernardo

> On Oct 1, 2024, at 6:58 AM, James Berragan <jberra...@gmail.com> wrote:
> 
> It seems this has triggered some important discussions about CEP-1 and the 
> Sidecar. Let's keep those in their respective threads and focus this 
> conversation on CEP-44.
> 
> Patrick, I think I missed your point "There is also little mention of where 
> the increased resource load would be handled." - you're right, running CDC in 
> the Sidecar implicitly means it uses additional resources in the C* cluster. 
> This resource usage is proportional to the write throughput, so it's not 
> suitable for use cases with very high write throughput, but our experience 
> has been that for standard mixed workloads the overhead is minimal. The 
> throttling built in safely handles burst workloads.
> 
> James.
> 
> On Mon, 30 Sept 2024 at 14:22, Josh McKenzie <jmcken...@apache.org 
> <mailto:jmcken...@apache.org>> wrote:
>>> This is the type of hidden subproject that will get us into trouble with 
>>> the board/foundation.   I'm sure it's getting enough committer eyeballs, 
>>> and some PMC oversight, but maybe not enough.
>> I don't agree with the qualifier of it as being hidden. It's definitely 
>> lower traffic than the main project but there's movement on the JIRA here: 
>> link 
>> <https://issues.apache.org/jira/issues/?jql=project%20%3D%20CASSANDRASC%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20updated%20DESC>.
>> 
>> I assume the sidecar is going to take longer to reach a tipping point where 
>> more people start contributing to it until it has compelling features 
>> that'll incentivize folks running their own bespoke sidecars to migrate over.
>> 
>> Agree with all your points Jon; there's a lot to be done on it.
>> 
>> CEP-1 is pretty much abandoned yeah. I think it'd be reasonable to close it 
>> down and open up a new one w/active contributors + active shepherd and a 
>> much more limited scope.
>> 
>> On Mon, Sep 30, 2024, at 2:13 PM, Patrick McFadin wrote:
>>> I'm mentioning it because I was surprised and I feel like I generally have 
>>> a finger on the pulse of the project.
>>> 
>>> I would love to talk about it more and get more community support and 
>>> interest.
>>> 
>>> On Mon, Sep 30, 2024 at 11:01 AM Mick Semb Wever <m...@apache.org 
>>> <mailto:m...@apache.org>> wrote:
>>> Agree with Jon, Josh and Patrick here.
>>> 
>>> This is the type of hidden subproject that will get us into trouble with 
>>> the board/foundation.   I'm sure it's getting enough committer eyeballs, 
>>> and some PMC oversight, but maybe not enough.  Addressing the more material 
>>> points that Jon mentions is the best way to deal with that IMHO.
>>> 
>>> 
>>> 
>>> On Mon, 30 Sept 2024 at 20:37, Jon Haddad <j...@rustyrazorblade.com 
>>> <mailto:j...@rustyrazorblade.com>> wrote:
>>> I think it depends on what lens you're looking at the sidecar through.
>>> 
>>> If you're actively working on it, and pulling it into your own infra, sure. 
>>>  It's a thing. 
>>> 
>>> If you're an outsider?  I have a hard time seeing it.
>>> 
>>> - No documentation as to what it does
>>> - No releases
>>> - No build instructions
>>> - Trying to build using standard gradle commands fails [1]
>>> - Included configs don't work out of the box. [2][3]
>>> - CEP-1 looks abandonded
>>> - The primary reason right now to use it looks to be analytics library, 
>>> which doesn't work for most teams due to lack of vnode support [4]
>>> 
>>> I think if you were to take a poll of 100 users outside this ML, I'd bet 
>>> almost every one would agree the sidecar isn't a thing yet, and that's 
>>> probably more important than if it's actually getting worked on.  I think 
>>> it has quite a ways to go before it looks to be more than an idea that's 
>>> incubating.
>>> 
>>> [1] https://issues.apache.org/jira/browse/CASSANDRASC-120
>>> [2 https://issues.apache.org/jira/browse/CASSANDRASC-121
>>> [3] https://issues.apache.org/jira/browse/CASSANDRASC-122
>>> [4] https://issues.apache.org/jira/browse/CASSANDRA-19594
>>> 
>>> 
>>> On Mon, Sep 30, 2024 at 11:14 AM Josh McKenzie <jmcken...@apache.org 
>>> <mailto:jmcken...@apache.org>> wrote:
>>> 
>>> The CEP for the sidecar has stalled. The sidecar itself is very much alive 
>>> and a thing.
>>> 
>>> CEP != artifact.
>>> 
>>> We should definitely clean that up though.
>>> 
>>> On Mon, Sep 30, 2024, at 10:59 AM, Dinesh Joshi wrote:
>>>> Patrick, could you please elaborate? The Sidecar has been a thing for a 
>>>> while now.
>>>> 
>>>> On Mon, Sep 30, 2024 at 7:51 AM Patrick McFadin <pmcfa...@gmail.com 
>>>> <mailto:pmcfa...@gmail.com>> wrote:
>>>> I made the mistake of asking two things in one email. 
>>>> 
>>>> First thing I asked. Sidecar? Stalled CEP so why is this being talked 
>>>> about like this is a thing?
>>>> 
>>>> On Mon, Sep 30, 2024 at 7:21 AM Benedict <bened...@apache.org 
>>>> <mailto:bened...@apache.org>> wrote:
>>>> 
>>>> Sorry Bernardo, you may have misunderstood me. I don’t have any concerns, 
>>>> I was suggesting a possible future scenario where CDC for Kafka via 
>>>> sidecar is changed to use a hypothetical future topic subscription service 
>>>> provided by C*. It was meant to show that this CEP may be easily decoupled 
>>>> from any future evolution in this area. 
>>>> 
>>>> 
>>>>> On 30 Sep 2024, at 14:58, Bernardo Botella <conta...@bernardobotella.com 
>>>>> <mailto:conta...@bernardobotella.com>> wrote:
>>>>> Thanks everyone for the comments.
>>>> 
>>>>> 
>>>>> Patrick:
>>>>> The proposal includes a “best effort” approach for deduplication (some 
>>>>> details can be found on the Digest class comments on the PR here 
>>>>> https://github.com/apache/cassandra-analytics/pull/87/files#diff-3a09caecc1da13419d92cde56a7cfc7d253faac08182e6c2768b3d32c015de82R185-R193
>>>>>  ). That alone won’t eliminate all the duplicates, but as Josh points 
>>>>> out, it moves the line to something way easier to handle for consumers, 
>>>>> and definitely on the direction we should aim towards. Accord is 
>>>>> definitely something this contribution will benefit from, that will move 
>>>>> that line even further.
>>>>> 
>>>>> Benedict:
>>>>> If I understand it correctly, your concern is that Kafka is somewhat the 
>>>>> hardcoded option for a CDC stream being published? The proposal 
>>>>> introduces a concept of data sources and sinks 
>>>>> (https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=323488575#CEP44:KafkaintegrationforCassandraCDCusingSidecar-SourcesandSinks)
>>>>>  being kafka the first implemented data sink. That means that the actual 
>>>>> Kafka output should (will) be something pluggable.
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Sep 30, 2024, at 5:43 AM, Josh McKenzie <jmcken...@apache.org 
>>>>>> <mailto:jmcken...@apache.org>> wrote:
>>>>>> 
>>>>>>> I don't see much on how this would be handled other than "left to the 
>>>>>>> end user to figure out." 
>>>>>> My immediate thought when I read that was "Yes. But it's moving where we 
>>>>>> draw the line of 'left to the end user to figure out' much further than 
>>>>>> it was before".
>>>>>> 
>>>>>> This should only be necessary in edge cases w/extended severe degraded 
>>>>>> availability where you can't hit QUORUM w/this design. So we go from 
>>>>>> "De-dupe literally everything o ye' user" to "de-dupe a small fraction 
>>>>>> of a % of the time when things really go off the rails".
>>>>>> 
>>>>>> It still leaves the burden of processing potential duplicates 
>>>>>> downstream, so some complexity burden on the users remains if they have 
>>>>>> no tolerance for processing duplicate messages, however the underlying 
>>>>>> machine resource utilization (from "dedupe everything" to "dedupe a 
>>>>>> small % of things") is pretty massively shifted by this design change. 
>>>>>> That, and using the hash of the mutation the way the extended design 
>>>>>> does is something a downstream consumer could also do on their side to 
>>>>>> ensure anything that came in past the drop-off window wasn't already 
>>>>>> seen. So not too painful; certainly a vast improvement over the status 
>>>>>> quo.
>>>>>> 
>>>>>> As to TCM and Accord: absolutely agree. I'd love to see a world where we 
>>>>>> Accord everything and fire off CDC to subscribers from a coordinator 
>>>>>> bypassing all this LSM-bastardized post-processing for CDC for instance. 
>>>>>> That said, this is a functionality users needed back in... 2016? When we 
>>>>>> first added CDC. So I think it's worth it to move on it now while 
>>>>>> retaining architectural options to move to updated metadata and 
>>>>>> transactions as they mature (obviously we'll lean on TCM since it's in 
>>>>>> 5.0 / trunk right now; more applies to the accord bit).
>>>>>> 
>>>>>> On Mon, Sep 30, 2024, at 3:20 AM, Benedict wrote:
>>>>>>> 
>>>>>>> Yes, with accord it should be fairly easy to have reliable no-dupe log 
>>>>>>> streaming without an elected leader. Given the broad set of use cases, 
>>>>>>> I can imagine supporting some more native topic subscription API in C* 
>>>>>>> rather than requiring Kafka, so perhaps any integration of Kafka with 
>>>>>>> the sidecar can be considered a separate parallel effort, that might 
>>>>>>> eventually implement itself with this C* feature whenever it 
>>>>>>> materialises?
>>>>>>> 
>>>>>>> 
>>>>>>>> On 30 Sep 2024, at 03:42, Jeff Jirsa <jji...@gmail.com 
>>>>>>>> <mailto:jji...@gmail.com>> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Transactional metadata and Accord should make it MUCH easier to do 
>>>>>>>> duplication avoiding CDC (and I was going to note that someone should 
>>>>>>>> ensure that the interfaces exposed to the public are stable enough not 
>>>>>>>> to change the published api once those exist)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Sep 29, 2024, at 7:04 PM, Patrick McFadin <pmcfa...@gmail.com 
>>>>>>>>> <mailto:pmcfa...@gmail.com>> wrote:
>>>>>>>>> 
>>>>>>>>> As I was reviewing this, it occurred to me that it was talking about 
>>>>>>>>> Sidecar like it was a thing but that CEP has been stalled for quite 
>>>>>>>>> some time:  
>>>>>>>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=95652224
>>>>>>>>> 
>>>>>>>>> If work on this is being done, should we get this official and 
>>>>>>>>> wrapped up?
>>>>>>>>> 
>>>>>>>>> On to the proposal...
>>>>>>>>> 
>>>>>>>>> This has been a topic on the project for over 10 years now. I've seen 
>>>>>>>>> multiple goes at making this work and the issue that always turns out 
>>>>>>>>> to torpedo the project is handing dupes. To the point that they go 
>>>>>>>>> from a generalized Kafka producer engine to something specific to a 
>>>>>>>>> particular use case. I don't see much on how this would be handled 
>>>>>>>>> other than "left to the end user to figure out." 
>>>>>>>>> 
>>>>>>>>> There is also little mention of where the increased resource load 
>>>>>>>>> would be handled. 
>>>>>>>>> 
>>>>>>>>> This has been discussed many times before, but is it time to 
>>>>>>>>> introduce the concept of an elected leader for a token range for this 
>>>>>>>>> type of operation? It would eliminate a ton of problems that need to 
>>>>>>>>> managed when bridging c* to a system like Kafka. Last time it was 
>>>>>>>>> discussed in earnest was for KIP-30: 
>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-30+-+Allow+for+brokers+to+have+plug-able+consensus+and+meta+data+storage+sub+systems
>>>>>>>>>  
>>>>>>>>> 
>>>>>>>>> Patrick
>>>>>>>>> 
>>>>>>>>> On Sat, Sep 28, 2024 at 11:44 AM Jon Haddad <j...@rustyrazorblade.com 
>>>>>>>>> <mailto:j...@rustyrazorblade.com>> wrote:
>>>>>>>>> Yes! I’m really looking forward to trying this out. The CEP looks 
>>>>>>>>> really well thought out. I think this will make CDC a lot more useful 
>>>>>>>>> for a lot of teams. 
>>>>>>>>> Jon
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Fri, Sep 27, 2024 at 4:23 PM Josh McKenzie <jmcken...@apache.org 
>>>>>>>>> <mailto:jmcken...@apache.org>> wrote:
>>>>>>>>> 
>>>>>>>>> Really excited to see this hit the ML James.
>>>>>>>>> 
>>>>>>>>> As author of the base CDC (get your stones ready for throwing :D) and 
>>>>>>>>> someone moderately involved in the CEP here, definitely welcome any 
>>>>>>>>> questions. CDC is a thorny problem in a multi-replica distributed 
>>>>>>>>> system like this.
>>>>>>>>> 
>>>>>>>>> On Fri, Sep 27, 2024, at 5:40 PM, James Berragan wrote:
>>>>>>>>>> Hi everyone,
>>>>>>>>>> 
>>>>>>>>>> Wiki: 
>>>>>>>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-44%3A+Kafka+integration+for+Cassandra+CDC+using+Sidecar
>>>>>>>>>> 
>>>>>>>>>> We would like to propose this CEP for adoption by the community.
>>>>>>>>>> 
>>>>>>>>>> CDC is a common technique in databases but right now there is no 
>>>>>>>>>> out-of-the-box solution to do this easily and at scale with 
>>>>>>>>>> Cassandra. Our proposal is to build a fully-fledged solution into 
>>>>>>>>>> the Apache Cassandra Sidecar. This comes with a number of benefits:
>>>>>>>>>> - Sidecar is an official part of the existing Cassandra eco-system.
>>>>>>>>>> - Sidecar runs co-located with Cassandra instances and so scales 
>>>>>>>>>> with the cluster size.
>>>>>>>>>> - Sidecar can access the underlying Cassandra database to store CDC 
>>>>>>>>>> configuration and the CDC state in a special table.
>>>>>>>>>> - Running in the Sidecar does not require additional external 
>>>>>>>>>> resources to run.
>>>>>>>>>> 
>>>>>>>>>> The core CDC module we anticipate will be pluggable and re-usable, 
>>>>>>>>>> it is available for review here: 
>>>>>>>>>> https://github.com/apache/cassandra-analytics/pull/87. The remaining 
>>>>>>>>>> Sidecar code will follow.
>>>>>>>>>> 
>>>>>>>>>> As a reminder, please keep the discussion here on the dev list vs. 
>>>>>>>>>> in the wiki, as we’ve found it easier to manage via email.
>>>>>>>>>> 
>>>>>>>>>> Sincerely,
>>>>>>>>>> James Berragan
>>>>>>>>>> Bernardo Botella Corbi
>>>>>>>>>> Yifan Cai
>>>>>>>>>> Jyothsna Konisa
>>> 
>>

Re: [DISCUSS] CEP-44: Kafka integration for Cassandra CDC using Sidecar

Reply via email to