Hi everyone, I’d like to make one final round of feedback request for this CEP-44: Kafka integration for Cassandra CDC using Sidecar before calling in a vote. We’ll leave it open for a few more days, and if nothing else comes in, we will call in a vote.
Bernardo > On Oct 1, 2024, at 6:58 AM, James Berragan <jberra...@gmail.com> wrote: > > It seems this has triggered some important discussions about CEP-1 and the > Sidecar. Let's keep those in their respective threads and focus this > conversation on CEP-44. > > Patrick, I think I missed your point "There is also little mention of where > the increased resource load would be handled." - you're right, running CDC in > the Sidecar implicitly means it uses additional resources in the C* cluster. > This resource usage is proportional to the write throughput, so it's not > suitable for use cases with very high write throughput, but our experience > has been that for standard mixed workloads the overhead is minimal. The > throttling built in safely handles burst workloads. > > James. > > On Mon, 30 Sept 2024 at 14:22, Josh McKenzie <jmcken...@apache.org > <mailto:jmcken...@apache.org>> wrote: >>> This is the type of hidden subproject that will get us into trouble with >>> the board/foundation. I'm sure it's getting enough committer eyeballs, >>> and some PMC oversight, but maybe not enough. >> I don't agree with the qualifier of it as being hidden. It's definitely >> lower traffic than the main project but there's movement on the JIRA here: >> link >> <https://issues.apache.org/jira/issues/?jql=project%20%3D%20CASSANDRASC%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20updated%20DESC>. >> >> I assume the sidecar is going to take longer to reach a tipping point where >> more people start contributing to it until it has compelling features >> that'll incentivize folks running their own bespoke sidecars to migrate over. >> >> Agree with all your points Jon; there's a lot to be done on it. >> >> CEP-1 is pretty much abandoned yeah. I think it'd be reasonable to close it >> down and open up a new one w/active contributors + active shepherd and a >> much more limited scope. >> >> On Mon, Sep 30, 2024, at 2:13 PM, Patrick McFadin wrote: >>> I'm mentioning it because I was surprised and I feel like I generally have >>> a finger on the pulse of the project. >>> >>> I would love to talk about it more and get more community support and >>> interest. >>> >>> On Mon, Sep 30, 2024 at 11:01 AM Mick Semb Wever <m...@apache.org >>> <mailto:m...@apache.org>> wrote: >>> Agree with Jon, Josh and Patrick here. >>> >>> This is the type of hidden subproject that will get us into trouble with >>> the board/foundation. I'm sure it's getting enough committer eyeballs, >>> and some PMC oversight, but maybe not enough. Addressing the more material >>> points that Jon mentions is the best way to deal with that IMHO. >>> >>> >>> >>> On Mon, 30 Sept 2024 at 20:37, Jon Haddad <j...@rustyrazorblade.com >>> <mailto:j...@rustyrazorblade.com>> wrote: >>> I think it depends on what lens you're looking at the sidecar through. >>> >>> If you're actively working on it, and pulling it into your own infra, sure. >>> It's a thing. >>> >>> If you're an outsider? I have a hard time seeing it. >>> >>> - No documentation as to what it does >>> - No releases >>> - No build instructions >>> - Trying to build using standard gradle commands fails [1] >>> - Included configs don't work out of the box. [2][3] >>> - CEP-1 looks abandonded >>> - The primary reason right now to use it looks to be analytics library, >>> which doesn't work for most teams due to lack of vnode support [4] >>> >>> I think if you were to take a poll of 100 users outside this ML, I'd bet >>> almost every one would agree the sidecar isn't a thing yet, and that's >>> probably more important than if it's actually getting worked on. I think >>> it has quite a ways to go before it looks to be more than an idea that's >>> incubating. >>> >>> [1] https://issues.apache.org/jira/browse/CASSANDRASC-120 >>> [2 https://issues.apache.org/jira/browse/CASSANDRASC-121 >>> [3] https://issues.apache.org/jira/browse/CASSANDRASC-122 >>> [4] https://issues.apache.org/jira/browse/CASSANDRA-19594 >>> >>> >>> On Mon, Sep 30, 2024 at 11:14 AM Josh McKenzie <jmcken...@apache.org >>> <mailto:jmcken...@apache.org>> wrote: >>> >>> The CEP for the sidecar has stalled. The sidecar itself is very much alive >>> and a thing. >>> >>> CEP != artifact. >>> >>> We should definitely clean that up though. >>> >>> On Mon, Sep 30, 2024, at 10:59 AM, Dinesh Joshi wrote: >>>> Patrick, could you please elaborate? The Sidecar has been a thing for a >>>> while now. >>>> >>>> On Mon, Sep 30, 2024 at 7:51 AM Patrick McFadin <pmcfa...@gmail.com >>>> <mailto:pmcfa...@gmail.com>> wrote: >>>> I made the mistake of asking two things in one email. >>>> >>>> First thing I asked. Sidecar? Stalled CEP so why is this being talked >>>> about like this is a thing? >>>> >>>> On Mon, Sep 30, 2024 at 7:21 AM Benedict <bened...@apache.org >>>> <mailto:bened...@apache.org>> wrote: >>>> >>>> Sorry Bernardo, you may have misunderstood me. I don’t have any concerns, >>>> I was suggesting a possible future scenario where CDC for Kafka via >>>> sidecar is changed to use a hypothetical future topic subscription service >>>> provided by C*. It was meant to show that this CEP may be easily decoupled >>>> from any future evolution in this area. >>>> >>>> >>>>> On 30 Sep 2024, at 14:58, Bernardo Botella <conta...@bernardobotella.com >>>>> <mailto:conta...@bernardobotella.com>> wrote: >>>>> Thanks everyone for the comments. >>>> >>>>> >>>>> Patrick: >>>>> The proposal includes a “best effort” approach for deduplication (some >>>>> details can be found on the Digest class comments on the PR here >>>>> https://github.com/apache/cassandra-analytics/pull/87/files#diff-3a09caecc1da13419d92cde56a7cfc7d253faac08182e6c2768b3d32c015de82R185-R193 >>>>> ). That alone won’t eliminate all the duplicates, but as Josh points >>>>> out, it moves the line to something way easier to handle for consumers, >>>>> and definitely on the direction we should aim towards. Accord is >>>>> definitely something this contribution will benefit from, that will move >>>>> that line even further. >>>>> >>>>> Benedict: >>>>> If I understand it correctly, your concern is that Kafka is somewhat the >>>>> hardcoded option for a CDC stream being published? The proposal >>>>> introduces a concept of data sources and sinks >>>>> (https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=323488575#CEP44:KafkaintegrationforCassandraCDCusingSidecar-SourcesandSinks) >>>>> being kafka the first implemented data sink. That means that the actual >>>>> Kafka output should (will) be something pluggable. >>>>> >>>>> >>>>> >>>>>> On Sep 30, 2024, at 5:43 AM, Josh McKenzie <jmcken...@apache.org >>>>>> <mailto:jmcken...@apache.org>> wrote: >>>>>> >>>>>>> I don't see much on how this would be handled other than "left to the >>>>>>> end user to figure out." >>>>>> My immediate thought when I read that was "Yes. But it's moving where we >>>>>> draw the line of 'left to the end user to figure out' much further than >>>>>> it was before". >>>>>> >>>>>> This should only be necessary in edge cases w/extended severe degraded >>>>>> availability where you can't hit QUORUM w/this design. So we go from >>>>>> "De-dupe literally everything o ye' user" to "de-dupe a small fraction >>>>>> of a % of the time when things really go off the rails". >>>>>> >>>>>> It still leaves the burden of processing potential duplicates >>>>>> downstream, so some complexity burden on the users remains if they have >>>>>> no tolerance for processing duplicate messages, however the underlying >>>>>> machine resource utilization (from "dedupe everything" to "dedupe a >>>>>> small % of things") is pretty massively shifted by this design change. >>>>>> That, and using the hash of the mutation the way the extended design >>>>>> does is something a downstream consumer could also do on their side to >>>>>> ensure anything that came in past the drop-off window wasn't already >>>>>> seen. So not too painful; certainly a vast improvement over the status >>>>>> quo. >>>>>> >>>>>> As to TCM and Accord: absolutely agree. I'd love to see a world where we >>>>>> Accord everything and fire off CDC to subscribers from a coordinator >>>>>> bypassing all this LSM-bastardized post-processing for CDC for instance. >>>>>> That said, this is a functionality users needed back in... 2016? When we >>>>>> first added CDC. So I think it's worth it to move on it now while >>>>>> retaining architectural options to move to updated metadata and >>>>>> transactions as they mature (obviously we'll lean on TCM since it's in >>>>>> 5.0 / trunk right now; more applies to the accord bit). >>>>>> >>>>>> On Mon, Sep 30, 2024, at 3:20 AM, Benedict wrote: >>>>>>> >>>>>>> Yes, with accord it should be fairly easy to have reliable no-dupe log >>>>>>> streaming without an elected leader. Given the broad set of use cases, >>>>>>> I can imagine supporting some more native topic subscription API in C* >>>>>>> rather than requiring Kafka, so perhaps any integration of Kafka with >>>>>>> the sidecar can be considered a separate parallel effort, that might >>>>>>> eventually implement itself with this C* feature whenever it >>>>>>> materialises? >>>>>>> >>>>>>> >>>>>>>> On 30 Sep 2024, at 03:42, Jeff Jirsa <jji...@gmail.com >>>>>>>> <mailto:jji...@gmail.com>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> Transactional metadata and Accord should make it MUCH easier to do >>>>>>>> duplication avoiding CDC (and I was going to note that someone should >>>>>>>> ensure that the interfaces exposed to the public are stable enough not >>>>>>>> to change the published api once those exist) >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> On Sep 29, 2024, at 7:04 PM, Patrick McFadin <pmcfa...@gmail.com >>>>>>>>> <mailto:pmcfa...@gmail.com>> wrote: >>>>>>>>> >>>>>>>>> As I was reviewing this, it occurred to me that it was talking about >>>>>>>>> Sidecar like it was a thing but that CEP has been stalled for quite >>>>>>>>> some time: >>>>>>>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=95652224 >>>>>>>>> >>>>>>>>> If work on this is being done, should we get this official and >>>>>>>>> wrapped up? >>>>>>>>> >>>>>>>>> On to the proposal... >>>>>>>>> >>>>>>>>> This has been a topic on the project for over 10 years now. I've seen >>>>>>>>> multiple goes at making this work and the issue that always turns out >>>>>>>>> to torpedo the project is handing dupes. To the point that they go >>>>>>>>> from a generalized Kafka producer engine to something specific to a >>>>>>>>> particular use case. I don't see much on how this would be handled >>>>>>>>> other than "left to the end user to figure out." >>>>>>>>> >>>>>>>>> There is also little mention of where the increased resource load >>>>>>>>> would be handled. >>>>>>>>> >>>>>>>>> This has been discussed many times before, but is it time to >>>>>>>>> introduce the concept of an elected leader for a token range for this >>>>>>>>> type of operation? It would eliminate a ton of problems that need to >>>>>>>>> managed when bridging c* to a system like Kafka. Last time it was >>>>>>>>> discussed in earnest was for KIP-30: >>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-30+-+Allow+for+brokers+to+have+plug-able+consensus+and+meta+data+storage+sub+systems >>>>>>>>> >>>>>>>>> >>>>>>>>> Patrick >>>>>>>>> >>>>>>>>> On Sat, Sep 28, 2024 at 11:44 AM Jon Haddad <j...@rustyrazorblade.com >>>>>>>>> <mailto:j...@rustyrazorblade.com>> wrote: >>>>>>>>> Yes! I’m really looking forward to trying this out. The CEP looks >>>>>>>>> really well thought out. I think this will make CDC a lot more useful >>>>>>>>> for a lot of teams. >>>>>>>>> Jon >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Sep 27, 2024 at 4:23 PM Josh McKenzie <jmcken...@apache.org >>>>>>>>> <mailto:jmcken...@apache.org>> wrote: >>>>>>>>> >>>>>>>>> Really excited to see this hit the ML James. >>>>>>>>> >>>>>>>>> As author of the base CDC (get your stones ready for throwing :D) and >>>>>>>>> someone moderately involved in the CEP here, definitely welcome any >>>>>>>>> questions. CDC is a thorny problem in a multi-replica distributed >>>>>>>>> system like this. >>>>>>>>> >>>>>>>>> On Fri, Sep 27, 2024, at 5:40 PM, James Berragan wrote: >>>>>>>>>> Hi everyone, >>>>>>>>>> >>>>>>>>>> Wiki: >>>>>>>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-44%3A+Kafka+integration+for+Cassandra+CDC+using+Sidecar >>>>>>>>>> >>>>>>>>>> We would like to propose this CEP for adoption by the community. >>>>>>>>>> >>>>>>>>>> CDC is a common technique in databases but right now there is no >>>>>>>>>> out-of-the-box solution to do this easily and at scale with >>>>>>>>>> Cassandra. Our proposal is to build a fully-fledged solution into >>>>>>>>>> the Apache Cassandra Sidecar. This comes with a number of benefits: >>>>>>>>>> - Sidecar is an official part of the existing Cassandra eco-system. >>>>>>>>>> - Sidecar runs co-located with Cassandra instances and so scales >>>>>>>>>> with the cluster size. >>>>>>>>>> - Sidecar can access the underlying Cassandra database to store CDC >>>>>>>>>> configuration and the CDC state in a special table. >>>>>>>>>> - Running in the Sidecar does not require additional external >>>>>>>>>> resources to run. >>>>>>>>>> >>>>>>>>>> The core CDC module we anticipate will be pluggable and re-usable, >>>>>>>>>> it is available for review here: >>>>>>>>>> https://github.com/apache/cassandra-analytics/pull/87. The remaining >>>>>>>>>> Sidecar code will follow. >>>>>>>>>> >>>>>>>>>> As a reminder, please keep the discussion here on the dev list vs. >>>>>>>>>> in the wiki, as we’ve found it easier to manage via email. >>>>>>>>>> >>>>>>>>>> Sincerely, >>>>>>>>>> James Berragan >>>>>>>>>> Bernardo Botella Corbi >>>>>>>>>> Yifan Cai >>>>>>>>>> Jyothsna Konisa >>> >>