Hi Stefan, CDC is something we are also thinking about, and worthy of a 
separate discussion. We have tested Spark Streaming for CDC and I hope we can 
bolt on in the future, but streaming technologies also come with more caveats 
and nuances (we have found it beneficial with CDC to store a small amount of 
state, which is at odds with Spark’s more stateless architecture). From that 
perspective I think it makes sense to keep CDC technology agnostic and let the 
user plug in to whichever system they want (Spark Streaming, Flink, custom etc).

James.

> On Apr 11, 2023, at 1:19 PM, Miklosovic, Stefan 
> <stefan.mikloso...@netapp.com> wrote:
> 
> Doug,
> 
> thanks for the diagrams, really helpful.
> 
> Do you think there might be some extension to this CEP (does not need to be 
> necessarily included from the very beginning, just food for though at this 
> point) which would read data from the commit log / CDC?
> 
> The main motivation behind this is that when one looks around in terms of 
> what is currently possible with Spark, Cassandra often exists as a sink only 
> when comes to streaming. For example, take Spark. We can use Kafka connector 
> (1) so data would come to Kafka, it would be streamed to Spark as RDDs and 
> Spark would save it to Cassandra via Spark Cassandra Connector. Such 
> transformation / pipeline is indeed possible.
> 
> We have also Cassandra + Ignite integration (2, 3) so Ignite can act as 
> in-memory caching layer on top of Cassandra which enables users to do 
> transformations over IgniteRDD and queries which are not possible normally. 
> (e.g. joins in SQL in Ignite over these caches etc). Very handy. But there is 
> no Ignite streamer which would consider Cassandra to be a realtime / near 
> realtime source.
> 
> So, there is currently no integration done (correct me if I am wrong) which 
> would have Cassandra as _real time_ source.
> 
> Looking into these diagrams, when you are able to load data from Cassandra 
> from SSTables, would it be possible to continually fetch offset in CDC index 
> file (these changes were done in 4.0 for the first time I think, ask Josh 
> McKenzie about the details), read these mutations and send it via Sidecar to 
> Spark?
> 
> Currently, the only solution I know of which is doing realtime-ish streaming 
> of mutations from CDC is Debezium Cassandra connector but it is pushing these 
> mutations straight to Kafka only. I would love to have it in Spark first and 
> then I can do whatever I want with that.
> 
> (1) https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
> (2) 
> https://ignite.apache.org/docs/latest/extensions-and-integrations/cassandra/overview
> (3) 
> https://ignite.apache.org/docs/latest/extensions-and-integrations/ignite-for-spark/ignitecontext-and-rdd
> (4) https://github.com/debezium/debezium-connector-cassandra
> 
> ________________________________________
> From: Doug Rohrer <droh...@apple.com <mailto:droh...@apple.com>>
> Sent: Tuesday, April 11, 2023 0:37
> To: dev@cassandra.apache.org <mailto:dev@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark 
> Bulk Analytics
> 
> NetApp Security WARNING: This is an external email. Do not click links or 
> open attachments unless you recognize the sender and know the content is safe.
> 
> 
> 
> I’ve updated the CEP with two overview diagrams of the interactions between 
> Sidecar, Cassandra, and the Bulk Analytics library.  Hope this helps folks 
> better understand how things work, and thanks for the patience as it took a 
> bit longer than expected for me to find the time for this.
> 
> Doug
> 
> On Apr 5, 2023, at 11:18 AM, Doug Rohrer <droh...@apple.com> wrote:
> 
> Sorry for the delay in responding here - yes, we can add some diagrams to the 
> CEP - I’ll try to get that done by end-of-week.
> 
> Thanks,
> 
> Doug
> 
> On Mar 28, 2023, at 1:14 PM, J. D. Jordan <jeremiah.jor...@gmail.com> wrote:
> 
> Maybe some data flow diagrams could be added to the cep showing some example 
> operations for read/write?
> 
> On Mar 28, 2023, at 11:35 AM, Yifan Cai <yc25c...@gmail.com> wrote:
> 
> 
> A lot of great discussions!
> 
> On the sidecar front, especially what the role sidecar plays in terms of this 
> CEP, I feel there might be some confusion. Once the code is published, we 
> should have clarity.
> Sidecar does not read sstables nor do any coordination for analytics queries. 
> It is local to the companion Cassandra instance. For bulk read, it takes 
> snapshots and streams sstables to spark workers to read. For bulk write, it 
> imports the sstables uploaded from spark workers. All commands are existing 
> jmx/nodetool functionalities from Cassandra. Sidecar adds the http interface 
> to them. It might be an over simplified description. The complex computation 
> is performed in spark clusters only.
> 
> In the long run, Cassandra might evolve into a database that does both OLTP 
> and OLAP. (Not what this thread aims for)
> At the current stage, Spark is very suited for analytic purposes.
> 
> On Tue, Mar 28, 2023 at 9:06 AM Benedict <bened...@apache.org 
> <mailto:bened...@apache.org><mailto:bened...@apache.org>> wrote:
> I disagree with the first claim, as the process has all the information it 
> chooses to utilise about which resources it’s using and what it’s using those 
> resources for.
> 
> The inability to isolate GC domains is something we cannot address, but also 
> probably not a problem if we were doing everything with memory management as 
> well as we could be.
> 
> But, not worth detailing this thread for. Today we do very little well on 
> this front within the process, and a separate process is well justified given 
> the state of play.
> 
> On 28 Mar 2023, at 16:38, Derek Chen-Becker <de...@chen-becker.org 
> <mailto:de...@chen-becker.org><mailto:de...@chen-becker.org>> wrote:
> 
> 
> 
> On Tue, Mar 28, 2023 at 9:03 AM Joseph Lynch <joe.e.ly...@gmail.com 
> <mailto:joe.e.ly...@gmail.com><mailto:joe.e.ly...@gmail.com>> wrote:
> ...
> 
> I think we might be underselling how valuable JVM isolation is,
> especially for analytics queries that are going to pass the entire
> dataset through heap somewhat constantly.
> 
> Big +1 here. The JVM simply does not have significant granularity of control 
> for resource utilization, but this is explicitly a feature of separate 
> processes. Add in being able to separate GC domains and you can avoid a lot 
> of noisy neighbor in-VM behavior for the disparate workloads.
> 
> Cheers,
> 
> Derek
> 
> 
> --
> +---------------------------------------------------------------+
> | Derek Chen-Becker                                             |
> | GPG Key available at https://keybase.io/dchenbecker and       |
> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
> +---------------------------------------------------------------+

Reply via email to