Re: [DISCUSS] CEP-44: Kafka integration for Cassandra CDC using Sidecar

Benedict Mon, 30 Sep 2024 07:21:08 -0700

Sorry Bernardo, you may have misunderstood me. I don’t have any concerns, I was suggesting a possible future scenario where CDC for Kafka via sidecar is changed to use a hypothetical future topic subscription service provided by C*. It was meant to show that this CEP may be easily decoupled from any future evolution in this area.

On 30 Sep 2024, at 14:58, Bernardo Botella <[email protected]> wrote:

Thanks everyone for the comments.

Patrick:
The proposal includes a “best effort” approach for deduplication (some details can be found on the Digest class comments on the PR here https://github.com/apache/cassandra-analytics/pull/87/files#diff-3a09caecc1da13419d92cde56a7cfc7d253faac08182e6c2768b3d32c015de82R185-R193 ). That alone won’t eliminate all the duplicates, but as Josh points out, it moves the line to something way easier to handle for consumers, and definitely on the direction we should aim towards. Accord is definitely something this contribution will benefit from, that will move that line even further.

Benedict:
If I understand it correctly, your concern is that Kafka is somewhat the hardcoded option for a CDC stream being published? The proposal introduces a concept of data sources and sinks (https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=323488575#CEP44:KafkaintegrationforCassandraCDCusingSidecar-SourcesandSinks) being kafka the first implemented data sink. That means that the actual Kafka output should (will) be something pluggable.

On Sep 30, 2024, at 5:43 AM, Josh McKenzie <[email protected]> wrote:

I don't see much on how this would be handled other than "left to the end user to figure out."
My immediate thought when I read that was "Yes. But it's moving where we draw the line of 'left to the end user to figure out' much further than it was before".

This should only be necessary in edge cases w/extended severe degraded availability where you can't hit QUORUM w/this design. So we go from "De-dupe literally everything o ye' user" to "de-dupe a small fraction of a % of the time when things really go off the rails".

It still leaves the burden of processing potential duplicates downstream, so some complexity burden on the users remains if they have no tolerance for processing duplicate messages, however the underlying machine resource utilization (from "dedupe everything" to "dedupe a small % of things") is pretty massively shifted by this design change. That, and using the hash of the mutation the way the extended design does is something a downstream consumer could also do on their side to ensure anything that came in past the drop-off window wasn't already seen. So not too painful; certainly a vast improvement over the status quo.

As to TCM and Accord: absolutely agree. I'd love to see a world where we Accord everything and fire off CDC to subscribers from a coordinator bypassing all this LSM-bastardized post-processing for CDC for instance. That said, this is a functionality users needed back in... 2016? When we first added CDC. So I think it's worth it to move on it now while retaining architectural options to move to updated metadata and transactions as they mature (obviously we'll lean on TCM since it's in 5.0 / trunk right now; more applies to the accord bit).

On Mon, Sep 30, 2024, at 3:20 AM, Benedict wrote:

Yes, with accord it should be fairly easy to have reliable no-dupe log streaming without an elected leader. Given the broad set of use cases, I can imagine supporting some more native topic subscription API in C* rather than requiring Kafka, so perhaps any integration of Kafka with the sidecar can be considered a separate parallel effort, that might eventually implement itself with this C* feature whenever it materialises?

On 30 Sep 2024, at 03:42, Jeff Jirsa <[email protected]> wrote:

Transactional metadata and Accord should make it MUCH easier to do duplication avoiding CDC (and I was going to note that someone should ensure that the interfaces exposed to the public are stable enough not to change the published api once those exist)

On Sep 29, 2024, at 7:04 PM, Patrick McFadin <[email protected]> wrote:

As I was reviewing this, it occurred to me that it was talking about Sidecar like it was a thing but that CEP has been stalled for quite some time: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=95652224

If work on this is being done, should we get this official and wrapped up?

On to the proposal...

This has been a topic on the project for over 10 years now. I've seen multiple goes at making this work and the issue that always turns out to torpedo the project is handing dupes. To the point that they go from a generalized Kafka producer engine to something specific to a particular use case. I don't see much on how this would be handled other than "left to the end user to figure out."

There is also little mention of where the increased resource load would be handled.

This has been discussed many times before, but is it time to introduce the concept of an elected leader for a token range for this type of operation? It would eliminate a ton of problems that need to managed when bridging c* to a system like Kafka. Last time it was discussed in earnest was for KIP-30: https://cwiki.apache.org/confluence/display/KAFKA/KIP-30+-+Allow+for+brokers+to+have+plug-able+consensus+and+meta+data+storage+sub+systems

Patrick

On Sat, Sep 28, 2024 at 11:44 AM Jon Haddad <[email protected]> wrote:
Yes! I’m really looking forward to trying this out. The CEP looks really well thought out. I think this will make CDC a lot more useful for a lot of teams.
Jon

On Fri, Sep 27, 2024 at 4:23 PM Josh McKenzie <[email protected]> wrote:

Really excited to see this hit the ML James.

As author of the base CDC (get your stones ready for throwing :D) and someone moderately involved in the CEP here, definitely welcome any questions. CDC is a thorny problem in a multi-replica distributed system like this.

On Fri, Sep 27, 2024, at 5:40 PM, James Berragan wrote:
Hi everyone,

Wiki: https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-44%3A+Kafka+integration+for+Cassandra+CDC+using+Sidecar

We would like to propose this CEP for adoption by the community.

CDC is a common technique in databases but right now there is no out-of-the-box solution to do this easily and at scale with Cassandra. Our proposal is to build a fully-fledged solution into the Apache Cassandra Sidecar. This comes with a number of benefits:
- Sidecar is an official part of the existing Cassandra eco-system.
- Sidecar runs co-located with Cassandra instances and so scales with the cluster size.
- Sidecar can access the underlying Cassandra database to store CDC configuration and the CDC state in a special table.
- Running in the Sidecar does not require additional external resources to run.

The core CDC module we anticipate will be pluggable and re-usable, it is available for review here: https://github.com/apache/cassandra-analytics/pull/87. The remaining Sidecar code will follow.

As a reminder, please keep the discussion here on the dev list vs. in the wiki, as we’ve found it easier to manage via email.

Sincerely,
James Berragan
Bernardo Botella Corbi
Yifan Cai
Jyothsna Konisa

Re: [DISCUSS] CEP-44: Kafka integration for Cassandra CDC using Sidecar

Reply via email to