Re: [DISCUSS] CEP-44: Kafka integration for Cassandra CDC using Sidecar

Josh McKenzie Mon, 30 Sep 2024 14:27:48 -0700

> This is the type of hidden subproject that will get us into trouble with the 
> board/foundation.   I'm sure it's getting enough committer eyeballs, and some 
> PMC oversight, but maybe not enough.
I don't agree with the qualifier of it as being hidden. It's definitely lower 
traffic than the main project but there's movement on the JIRA here: link 
<https://issues.apache.org/jira/issues/?jql=project%20%3D%20CASSANDRASC%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20updated%20DESC>.


I assume the sidecar is going to take longer to reach a tipping point where 
more people start contributing to it until it has compelling features that'll 
incentivize folks running their own bespoke sidecars to migrate over.

Agree with all your points Jon; there's a lot to be done on it.

CEP-1 is pretty much abandoned yeah. I think it'd be reasonable to close it 
down and open up a new one w/active contributors + active shepherd and a much 
more limited scope.

On Mon, Sep 30, 2024, at 2:13 PM, Patrick McFadin wrote:
> I'm mentioning it because I was surprised and I feel like I generally have a 
> finger on the pulse of the project.
> 
> I would love to talk about it more and get more community support and 
> interest.
> 
> On Mon, Sep 30, 2024 at 11:01 AM Mick Semb Wever <[email protected]> wrote:
>> Agree with Jon, Josh and Patrick here.
>> 
>> This is the type of hidden subproject that will get us into trouble with the 
>> board/foundation.   I'm sure it's getting enough committer eyeballs, and 
>> some PMC oversight, but maybe not enough.  Addressing the more material 
>> points that Jon mentions is the best way to deal with that IMHO.
>> 
>> 
>> 
>> On Mon, 30 Sept 2024 at 20:37, Jon Haddad <[email protected]> wrote:
>>> I think it depends on what lens you're looking at the sidecar through.
>>> 
>>> If you're actively working on it, and pulling it into your own infra, sure. 
>>>  It's a thing. 
>>> 
>>> If you're an outsider?  I have a hard time seeing it.
>>> 
>>> - No documentation as to what it does
>>> - No releases
>>> - No build instructions
>>> - Trying to build using standard gradle commands fails [1]
>>> - Included configs don't work out of the box. [2][3]
>>> - CEP-1 looks abandonded
>>> - The primary reason right now to use it looks to be analytics library, 
>>> which doesn't work for most teams due to lack of vnode support [4]
>>> 
>>> I think if you were to take a poll of 100 users outside this ML, I'd bet 
>>> almost every one would agree the sidecar isn't a thing yet, and that's 
>>> probably more important than if it's actually getting worked on.  I think 
>>> it has quite a ways to go before it looks to be more than an idea that's 
>>> incubating.
>>> 
>>> [1] https://issues.apache.org/jira/browse/CASSANDRASC-120
>>> [2 https://issues.apache.org/jira/browse/CASSANDRASC-121
>>> [3] https://issues.apache.org/jira/browse/CASSANDRASC-122
>>> [4] https://issues.apache.org/jira/browse/CASSANDRA-19594
>>> 
>>> 
>>> On Mon, Sep 30, 2024 at 11:14 AM Josh McKenzie <[email protected]> wrote:
>>>> __
>>>> The CEP for the sidecar has stalled. The sidecar itself is very much alive 
>>>> and a thing.
>>>> 
>>>> CEP != artifact.
>>>> 
>>>> We should definitely clean that up though.
>>>> 
>>>> On Mon, Sep 30, 2024, at 10:59 AM, Dinesh Joshi wrote:
>>>>> Patrick, could you please elaborate? The Sidecar has been a thing for a 
>>>>> while now.
>>>>> 
>>>>> On Mon, Sep 30, 2024 at 7:51 AM Patrick McFadin <[email protected]> 
>>>>> wrote:
>>>>>> I made the mistake of asking two things in one email. 
>>>>>> 
>>>>>> First thing I asked. Sidecar? Stalled CEP so why is this being talked 
>>>>>> about like this is a thing?
>>>>>> 
>>>>>> On Mon, Sep 30, 2024 at 7:21 AM Benedict <[email protected]> wrote:
>>>>>>> 
>>>>>>> Sorry Bernardo, you may have misunderstood me. I don’t have any 
>>>>>>> concerns, I was suggesting a possible future scenario where CDC for 
>>>>>>> Kafka via sidecar is changed to use a hypothetical future topic 
>>>>>>> subscription service provided by C*. It was meant to show that this CEP 
>>>>>>> may be easily decoupled from any future evolution in this area. 
>>>>>>> 
>>>>>>> 
>>>>>>>> On 30 Sep 2024, at 14:58, Bernardo Botella 
>>>>>>>> <[email protected]> wrote:
>>>>>>>> Thanks everyone for the comments.
>>>>>>>> 
>>>>>>>> Patrick:
>>>>>>>> The proposal includes a “best effort” approach for deduplication (some 
>>>>>>>> details can be found on the Digest class comments on the PR here 
>>>>>>>> https://github.com/apache/cassandra-analytics/pull/87/files#diff-3a09caecc1da13419d92cde56a7cfc7d253faac08182e6c2768b3d32c015de82R185-R193
>>>>>>>>  ). That alone won’t eliminate all the duplicates, but as Josh points 
>>>>>>>> out, it moves the line to something way easier to handle for 
>>>>>>>> consumers, and definitely on the direction we should aim towards. 
>>>>>>>> Accord is definitely something this contribution will benefit from, 
>>>>>>>> that will move that line even further.
>>>>>>>> 
>>>>>>>> Benedict:
>>>>>>>> If I understand it correctly, your concern is that Kafka is somewhat 
>>>>>>>> the hardcoded option for a CDC stream being published? The proposal 
>>>>>>>> introduces a concept of data sources and sinks 
>>>>>>>> (https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=323488575#CEP44:KafkaintegrationforCassandraCDCusingSidecar-SourcesandSinks)
>>>>>>>>  being kafka the first implemented data sink. That means that the 
>>>>>>>> actual Kafka output should (will) be something pluggable.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Sep 30, 2024, at 5:43 AM, Josh McKenzie <[email protected]> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> I don't see much on how this would be handled other than "left to 
>>>>>>>>>> the end user to figure out." 
>>>>>>>>> My immediate thought when I read that was "Yes. But it's moving where 
>>>>>>>>> we draw the line of 'left to the end user to figure out' *much 
>>>>>>>>> further* than it was before".
>>>>>>>>> 
>>>>>>>>> This should only be necessary in edge cases w/extended severe 
>>>>>>>>> degraded availability where you can't hit QUORUM w/this design. So we 
>>>>>>>>> go from "De-dupe literally everything o ye' user" to "de-dupe a small 
>>>>>>>>> fraction of a % of the time when things really go off the rails".
>>>>>>>>> 
>>>>>>>>> It still leaves the burden of processing potential duplicates 
>>>>>>>>> downstream, so some *complexity* burden on the users remains if they 
>>>>>>>>> have no tolerance for processing duplicate messages, however the 
>>>>>>>>> underlying machine resource utilization (from "dedupe everything" to 
>>>>>>>>> "dedupe a small % of things") is pretty massively shifted by this 
>>>>>>>>> design change. That, and using the hash of the mutation the way the 
>>>>>>>>> extended design does is something a downstream consumer could also do 
>>>>>>>>> on their side to ensure anything that came in past the drop-off 
>>>>>>>>> window wasn't already seen. So not *too* painful; certainly a vast 
>>>>>>>>> improvement over the status quo.
>>>>>>>>> 
>>>>>>>>> As to TCM and Accord: absolutely agree. I'd love to see a world where 
>>>>>>>>> we Accord everything and fire off CDC to subscribers from a 
>>>>>>>>> coordinator bypassing all this LSM-bastardized post-processing for 
>>>>>>>>> CDC for instance. That said, this is a functionality users needed 
>>>>>>>>> back in... 2016? When we first added CDC. So I think it's worth it to 
>>>>>>>>> move on it now while retaining architectural options to move to 
>>>>>>>>> updated metadata and transactions as they mature (obviously we'll 
>>>>>>>>> lean on TCM since it's in 5.0 / trunk right now; more applies to the 
>>>>>>>>> accord bit).
>>>>>>>>> 
>>>>>>>>> On Mon, Sep 30, 2024, at 3:20 AM, Benedict wrote:
>>>>>>>>>> 
>>>>>>>>>> Yes, with accord it should be fairly easy to have reliable no-dupe 
>>>>>>>>>> log streaming without an elected leader. Given the broad set of use 
>>>>>>>>>> cases, I can imagine supporting some more native topic subscription 
>>>>>>>>>> API in C* rather than requiring Kafka, so perhaps any integration of 
>>>>>>>>>> Kafka with the sidecar can be considered a separate parallel effort, 
>>>>>>>>>> that might eventually implement itself with this C* feature whenever 
>>>>>>>>>> it materialises?
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On 30 Sep 2024, at 03:42, Jeff Jirsa <[email protected]> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Transactional metadata and Accord should make it MUCH easier to do 
>>>>>>>>>>> duplication avoiding CDC (and I was going to note that someone 
>>>>>>>>>>> should ensure that the interfaces exposed to the public are stable 
>>>>>>>>>>> enough not to change the published api once those exist)
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On Sep 29, 2024, at 7:04 PM, Patrick McFadin <[email protected]> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> As I was reviewing this, it occurred to me that it was talking 
>>>>>>>>>>>> about Sidecar like it was a thing but that CEP has been stalled 
>>>>>>>>>>>> for quite some time:  
>>>>>>>>>>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=95652224
>>>>>>>>>>>> 
>>>>>>>>>>>> If work on this is being done, should we get this official and 
>>>>>>>>>>>> wrapped up?
>>>>>>>>>>>> 
>>>>>>>>>>>> On to the proposal...
>>>>>>>>>>>> 
>>>>>>>>>>>> This has been a topic on the project for over 10 years now. I've 
>>>>>>>>>>>> seen multiple goes at making this work and the issue that always 
>>>>>>>>>>>> turns out to torpedo the project is handing dupes. To the point 
>>>>>>>>>>>> that they go from a generalized Kafka producer engine to something 
>>>>>>>>>>>> specific to a particular use case. I don't see much on how this 
>>>>>>>>>>>> would be handled other than "left to the end user to figure out." 
>>>>>>>>>>>> 
>>>>>>>>>>>> There is also little mention of where the increased resource load 
>>>>>>>>>>>> would be handled. 
>>>>>>>>>>>> 
>>>>>>>>>>>> This has been discussed many times before, but is it time to 
>>>>>>>>>>>> introduce the concept of an elected leader for a token range for 
>>>>>>>>>>>> this type of operation? It would eliminate a ton of problems that 
>>>>>>>>>>>> need to managed when bridging c* to a system like Kafka. Last time 
>>>>>>>>>>>> it was discussed in earnest was for KIP-30: 
>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-30+-+Allow+for+brokers+to+have+plug-able+consensus+and+meta+data+storage+sub+systems
>>>>>>>>>>>>  
>>>>>>>>>>>> 
>>>>>>>>>>>> Patrick
>>>>>>>>>>>> 
>>>>>>>>>>>> On Sat, Sep 28, 2024 at 11:44 AM Jon Haddad 
>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>> Yes! I’m really looking forward to trying this out. The CEP looks 
>>>>>>>>>>>>> really well thought out. I think this will make CDC a lot more 
>>>>>>>>>>>>> useful for a lot of teams. 
>>>>>>>>>>>>> Jon
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Fri, Sep 27, 2024 at 4:23 PM Josh McKenzie 
>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>> __
>>>>>>>>>>>>>> Really excited to see this hit the ML James.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> As author of the base CDC (get your stones ready for throwing 
>>>>>>>>>>>>>> :D) and someone moderately involved in the CEP here, definitely 
>>>>>>>>>>>>>> welcome any questions. CDC is a *thorny* *problem *in a 
>>>>>>>>>>>>>> multi-replica distributed system like this.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Fri, Sep 27, 2024, at 5:40 PM, James Berragan wrote:
>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Wiki: 
>>>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-44%3A+Kafka+integration+for+Cassandra+CDC+using+Sidecar
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> We would like to propose this CEP for adoption by the community.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> CDC is a common technique in databases but right now there is 
>>>>>>>>>>>>>>> no out-of-the-box solution to do this easily and at scale with 
>>>>>>>>>>>>>>> Cassandra. Our proposal is to build a fully-fledged solution 
>>>>>>>>>>>>>>> into the Apache Cassandra Sidecar. This comes with a number of 
>>>>>>>>>>>>>>> benefits:
>>>>>>>>>>>>>>> - Sidecar is an official part of the existing Cassandra 
>>>>>>>>>>>>>>> eco-system.
>>>>>>>>>>>>>>> - Sidecar runs co-located with Cassandra instances and so 
>>>>>>>>>>>>>>> scales with the cluster size.
>>>>>>>>>>>>>>> - Sidecar can access the underlying Cassandra database to store 
>>>>>>>>>>>>>>> CDC configuration and the CDC state in a special table.
>>>>>>>>>>>>>>> - Running in the Sidecar does not require additional external 
>>>>>>>>>>>>>>> resources to run.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The core CDC module we anticipate will be pluggable and 
>>>>>>>>>>>>>>> re-usable, it is available for review here: 
>>>>>>>>>>>>>>> https://github.com/apache/cassandra-analytics/pull/87. The 
>>>>>>>>>>>>>>> remaining Sidecar code will follow.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> As a reminder, please keep the discussion here on the dev list 
>>>>>>>>>>>>>>> vs. in the wiki, as we’ve found it easier to manage via email.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Sincerely,
>>>>>>>>>>>>>>> James Berragan
>>>>>>>>>>>>>>> Bernardo Botella Corbi
>>>>>>>>>>>>>>> Yifan Cai
>>>>>>>>>>>>>>> Jyothsna Konisa
>>>>

Re: [DISCUSS] CEP-44: Kafka integration for Cassandra CDC using Sidecar

Reply via email to