Re: [DISCUSS] CEP-44: Kafka integration for Cassandra CDC using Sidecar

Dinesh Joshi Mon, 30 Sep 2024 08:01:39 -0700

Patrick, could you please elaborate? The Sidecar has been a thing for a
while now.


On Mon, Sep 30, 2024 at 7:51 AM Patrick McFadin <[email protected]> wrote:

> I made the mistake of asking two things in one email.
>
> First thing I asked. Sidecar? Stalled CEP so why is this being talked
> about like this is a thing?
>
> On Mon, Sep 30, 2024 at 7:21 AM Benedict <[email protected]> wrote:
>
>> Sorry Bernardo, you may have misunderstood me. I don’t have any concerns,
>> I was suggesting a possible future scenario where CDC for Kafka via sidecar
>> is changed to use a hypothetical future topic subscription service provided
>> by C*. It was meant to show that this CEP may be easily decoupled from any
>> future evolution in this area.
>>
>> On 30 Sep 2024, at 14:58, Bernardo Botella <[email protected]>
>> wrote:
>>
>> Thanks everyone for the comments.
>>
>>
>> Patrick:
>> The proposal includes a “best effort” approach for deduplication (some
>> details can be found on the Digest class comments on the PR here
>> https://github.com/apache/cassandra-analytics/pull/87/files#diff-3a09caecc1da13419d92cde56a7cfc7d253faac08182e6c2768b3d32c015de82R185-R193
>>  ).
>> That alone won’t eliminate all the duplicates, but as Josh points out, it
>> moves the line to something way easier to handle for consumers, and
>> definitely on the direction we should aim towards. Accord is definitely
>> something this contribution will benefit from, that will move that line
>> even further.
>>
>> Benedict:
>> If I understand it correctly, your concern is that Kafka is somewhat the
>> hardcoded option for a CDC stream being published? The proposal introduces
>> a concept of data sources and sinks (
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=323488575#CEP44:KafkaintegrationforCassandraCDCusingSidecar-SourcesandSinks)
>> being kafka the first implemented data sink. That means that the actual
>> Kafka output should (will) be something pluggable.
>>
>>
>>
>> On Sep 30, 2024, at 5:43 AM, Josh McKenzie <[email protected]> wrote:
>>
>> I don't see much on how this would be handled other than "left to the end
>> user to figure out."
>>
>> My immediate thought when I read that was "Yes. But it's moving where we
>> draw the line of 'left to the end user to figure out' *much further* than
>> it was before".
>>
>> This should only be necessary in edge cases w/extended severe degraded
>> availability where you can't hit QUORUM w/this design. So we go from
>> "De-dupe literally everything o ye' user" to "de-dupe a small fraction of a
>> % of the time when things really go off the rails".
>>
>> It still leaves the burden of processing potential duplicates downstream,
>> so some *complexity* burden on the users remains if they have no
>> tolerance for processing duplicate messages, however the underlying machine
>> resource utilization (from "dedupe everything" to "dedupe a small % of
>> things") is pretty massively shifted by this design change. That, and using
>> the hash of the mutation the way the extended design does is something a
>> downstream consumer could also do on their side to ensure anything that
>> came in past the drop-off window wasn't already seen. So not *too* painful;
>> certainly a vast improvement over the status quo.
>>
>> As to TCM and Accord: absolutely agree. I'd love to see a world where we
>> Accord everything and fire off CDC to subscribers from a coordinator
>> bypassing all this LSM-bastardized post-processing for CDC for instance.
>> That said, this is a functionality users needed back in... 2016? When we
>> first added CDC. So I think it's worth it to move on it now while retaining
>> architectural options to move to updated metadata and transactions as they
>> mature (obviously we'll lean on TCM since it's in 5.0 / trunk right now;
>> more applies to the accord bit).
>>
>> On Mon, Sep 30, 2024, at 3:20 AM, Benedict wrote:
>>
>>
>> Yes, with accord it should be fairly easy to have reliable no-dupe log
>> streaming without an elected leader. Given the broad set of use cases, I
>> can imagine supporting some more native topic subscription API in C* rather
>> than requiring Kafka, so perhaps any integration of Kafka with the sidecar
>> can be considered a separate parallel effort, that might eventually
>> implement itself with this C* feature whenever it materialises?
>>
>>
>> On 30 Sep 2024, at 03:42, Jeff Jirsa <[email protected]> wrote:
>>
>> 
>>
>> Transactional metadata and Accord should make it MUCH easier to do
>> duplication avoiding CDC (and I was going to note that someone should
>> ensure that the interfaces exposed to the public are stable enough not to
>> change the published api once those exist)
>>
>>
>>
>> On Sep 29, 2024, at 7:04 PM, Patrick McFadin <[email protected]> wrote:
>>
>> 
>> As I was reviewing this, it occurred to me that it was talking about
>> Sidecar like it was a thing but that CEP has been stalled for quite some
>> time:
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=95652224
>>
>> If work on this is being done, should we get this official and wrapped up?
>>
>> On to the proposal...
>>
>> This has been a topic on the project for over 10 years now. I've seen
>> multiple goes at making this work and the issue that always turns out to
>> torpedo the project is handing dupes. To the point that they go from a
>> generalized Kafka producer engine to something specific to a particular use
>> case. I don't see much on how this would be handled other than "left to the
>> end user to figure out."
>>
>> There is also little mention of where the increased resource load would
>> be handled.
>>
>> This has been discussed many times before, but is it time to introduce
>> the concept of an elected leader for a token range for this type of
>> operation? It would eliminate a ton of problems that need to managed when
>> bridging c* to a system like Kafka. Last time it was discussed in earnest
>> was for KIP-30:
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-30+-+Allow+for+brokers+to+have+plug-able+consensus+and+meta+data+storage+sub+systems
>>
>>
>> Patrick
>>
>> On Sat, Sep 28, 2024 at 11:44 AM Jon Haddad <[email protected]>
>> wrote:
>>
>> Yes! I’m really looking forward to trying this out. The CEP looks really
>> well thought out. I think this will make CDC a lot more useful for a lot of
>> teams.
>> Jon
>>
>>
>> On Fri, Sep 27, 2024 at 4:23 PM Josh McKenzie <[email protected]>
>> wrote:
>>
>>
>> Really excited to see this hit the ML James.
>>
>> As author of the base CDC (get your stones ready for throwing :D) and
>> someone moderately involved in the CEP here, definitely welcome any
>> questions. CDC is a *thorny* *problem *in a multi-replica distributed
>> system like this.
>>
>> On Fri, Sep 27, 2024, at 5:40 PM, James Berragan wrote:
>>
>> Hi everyone,
>>
>> Wiki:
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-44%3A+Kafka+integration+for+Cassandra+CDC+using+Sidecar
>>
>> We would like to propose this CEP for adoption by the community.
>>
>> CDC is a common technique in databases but right now there is no
>> out-of-the-box solution to do this easily and at scale with Cassandra. Our
>> proposal is to build a fully-fledged solution into the Apache Cassandra
>> Sidecar. This comes with a number of benefits:
>> - Sidecar is an official part of the existing Cassandra eco-system.
>> - Sidecar runs co-located with Cassandra instances and so scales with the
>> cluster size.
>> - Sidecar can access the underlying Cassandra database to store CDC
>> configuration and the CDC state in a special table.
>> - Running in the Sidecar does not require additional external resources
>> to run.
>>
>> The core CDC module we anticipate will be pluggable and re-usable, it is
>> available for review here:
>> https://github.com/apache/cassandra-analytics/pull/87. The remaining
>> Sidecar code will follow.
>>
>> As a reminder, please keep the discussion here on the dev list vs. in the
>> wiki, as we’ve found it easier to manage via email.
>>
>> Sincerely,
>> James Berragan
>> Bernardo Botella Corbi
>> Yifan Cai
>> Jyothsna Konisa
>>
>>
>>

Re: [DISCUSS] CEP-44: Kafka integration for Cassandra CDC using Sidecar

Reply via email to