Hi Stefan, CDC is something we are also thinking about, and worthy of a separate discussion. We have tested Spark Streaming for CDC and I hope we can bolt on in the future, but streaming technologies also come with more caveats and nuances (we have found it beneficial with CDC to store a small amount of state, which is at odds with Spark’s more stateless architecture). From that perspective I think it makes sense to keep CDC technology agnostic and let the user plug in to whichever system they want (Spark Streaming, Flink, custom etc).
James. > On Apr 11, 2023, at 1:19 PM, Miklosovic, Stefan > <stefan.mikloso...@netapp.com> wrote: > > Doug, > > thanks for the diagrams, really helpful. > > Do you think there might be some extension to this CEP (does not need to be > necessarily included from the very beginning, just food for though at this > point) which would read data from the commit log / CDC? > > The main motivation behind this is that when one looks around in terms of > what is currently possible with Spark, Cassandra often exists as a sink only > when comes to streaming. For example, take Spark. We can use Kafka connector > (1) so data would come to Kafka, it would be streamed to Spark as RDDs and > Spark would save it to Cassandra via Spark Cassandra Connector. Such > transformation / pipeline is indeed possible. > > We have also Cassandra + Ignite integration (2, 3) so Ignite can act as > in-memory caching layer on top of Cassandra which enables users to do > transformations over IgniteRDD and queries which are not possible normally. > (e.g. joins in SQL in Ignite over these caches etc). Very handy. But there is > no Ignite streamer which would consider Cassandra to be a realtime / near > realtime source. > > So, there is currently no integration done (correct me if I am wrong) which > would have Cassandra as _real time_ source. > > Looking into these diagrams, when you are able to load data from Cassandra > from SSTables, would it be possible to continually fetch offset in CDC index > file (these changes were done in 4.0 for the first time I think, ask Josh > McKenzie about the details), read these mutations and send it via Sidecar to > Spark? > > Currently, the only solution I know of which is doing realtime-ish streaming > of mutations from CDC is Debezium Cassandra connector but it is pushing these > mutations straight to Kafka only. I would love to have it in Spark first and > then I can do whatever I want with that. > > (1) https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html > (2) > https://ignite.apache.org/docs/latest/extensions-and-integrations/cassandra/overview > (3) > https://ignite.apache.org/docs/latest/extensions-and-integrations/ignite-for-spark/ignitecontext-and-rdd > (4) https://github.com/debezium/debezium-connector-cassandra > > ________________________________________ > From: Doug Rohrer <droh...@apple.com <mailto:droh...@apple.com>> > Sent: Tuesday, April 11, 2023 0:37 > To: dev@cassandra.apache.org <mailto:dev@cassandra.apache.org> > Subject: Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark > Bulk Analytics > > NetApp Security WARNING: This is an external email. Do not click links or > open attachments unless you recognize the sender and know the content is safe. > > > > I’ve updated the CEP with two overview diagrams of the interactions between > Sidecar, Cassandra, and the Bulk Analytics library. Hope this helps folks > better understand how things work, and thanks for the patience as it took a > bit longer than expected for me to find the time for this. > > Doug > > On Apr 5, 2023, at 11:18 AM, Doug Rohrer <droh...@apple.com> wrote: > > Sorry for the delay in responding here - yes, we can add some diagrams to the > CEP - I’ll try to get that done by end-of-week. > > Thanks, > > Doug > > On Mar 28, 2023, at 1:14 PM, J. D. Jordan <jeremiah.jor...@gmail.com> wrote: > > Maybe some data flow diagrams could be added to the cep showing some example > operations for read/write? > > On Mar 28, 2023, at 11:35 AM, Yifan Cai <yc25c...@gmail.com> wrote: > > > A lot of great discussions! > > On the sidecar front, especially what the role sidecar plays in terms of this > CEP, I feel there might be some confusion. Once the code is published, we > should have clarity. > Sidecar does not read sstables nor do any coordination for analytics queries. > It is local to the companion Cassandra instance. For bulk read, it takes > snapshots and streams sstables to spark workers to read. For bulk write, it > imports the sstables uploaded from spark workers. All commands are existing > jmx/nodetool functionalities from Cassandra. Sidecar adds the http interface > to them. It might be an over simplified description. The complex computation > is performed in spark clusters only. > > In the long run, Cassandra might evolve into a database that does both OLTP > and OLAP. (Not what this thread aims for) > At the current stage, Spark is very suited for analytic purposes. > > On Tue, Mar 28, 2023 at 9:06 AM Benedict <bened...@apache.org > <mailto:bened...@apache.org><mailto:bened...@apache.org>> wrote: > I disagree with the first claim, as the process has all the information it > chooses to utilise about which resources it’s using and what it’s using those > resources for. > > The inability to isolate GC domains is something we cannot address, but also > probably not a problem if we were doing everything with memory management as > well as we could be. > > But, not worth detailing this thread for. Today we do very little well on > this front within the process, and a separate process is well justified given > the state of play. > > On 28 Mar 2023, at 16:38, Derek Chen-Becker <de...@chen-becker.org > <mailto:de...@chen-becker.org><mailto:de...@chen-becker.org>> wrote: > > > > On Tue, Mar 28, 2023 at 9:03 AM Joseph Lynch <joe.e.ly...@gmail.com > <mailto:joe.e.ly...@gmail.com><mailto:joe.e.ly...@gmail.com>> wrote: > ... > > I think we might be underselling how valuable JVM isolation is, > especially for analytics queries that are going to pass the entire > dataset through heap somewhat constantly. > > Big +1 here. The JVM simply does not have significant granularity of control > for resource utilization, but this is explicitly a feature of separate > processes. Add in being able to separate GC domains and you can avoid a lot > of noisy neighbor in-VM behavior for the disparate workloads. > > Cheers, > > Derek > > > -- > +---------------------------------------------------------------+ > | Derek Chen-Becker | > | GPG Key available at https://keybase.io/dchenbecker and | > | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org | > | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7 7F42 AFC5 AFEE 96E4 6ACC | > +---------------------------------------------------------------+