Doug,

thanks for the diagrams, really helpful.

Do you think there might be some extension to this CEP (does not need to be 
necessarily included from the very beginning, just food for though at this 
point) which would read data from the commit log / CDC?

The main motivation behind this is that when one looks around in terms of what 
is currently possible with Spark, Cassandra often exists as a sink only when 
comes to streaming. For example, take Spark. We can use Kafka connector (1) so 
data would come to Kafka, it would be streamed to Spark as RDDs and Spark would 
save it to Cassandra via Spark Cassandra Connector. Such transformation / 
pipeline is indeed possible.

We have also Cassandra + Ignite integration (2, 3) so Ignite can act as 
in-memory caching layer on top of Cassandra which enables users to do 
transformations over IgniteRDD and queries which are not possible normally. 
(e.g. joins in SQL in Ignite over these caches etc). Very handy. But there is 
no Ignite streamer which would consider Cassandra to be a realtime / near 
realtime source.

So, there is currently no integration done (correct me if I am wrong) which 
would have Cassandra as _real time_ source.

Looking into these diagrams, when you are able to load data from Cassandra from 
SSTables, would it be possible to continually fetch offset in CDC index file 
(these changes were done in 4.0 for the first time I think, ask Josh McKenzie 
about the details), read these mutations and send it via Sidecar to Spark?

Currently, the only solution I know of which is doing realtime-ish streaming of 
mutations from CDC is Debezium Cassandra connector but it is pushing these 
mutations straight to Kafka only. I would love to have it in Spark first and 
then I can do whatever I want with that.

(1) https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
(2) 
https://ignite.apache.org/docs/latest/extensions-and-integrations/cassandra/overview
(3) 
https://ignite.apache.org/docs/latest/extensions-and-integrations/ignite-for-spark/ignitecontext-and-rdd
(4) https://github.com/debezium/debezium-connector-cassandra

________________________________________
From: Doug Rohrer <droh...@apple.com>
Sent: Tuesday, April 11, 2023 0:37
To: dev@cassandra.apache.org
Subject: Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark 
Bulk Analytics

NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.



I’ve updated the CEP with two overview diagrams of the interactions between 
Sidecar, Cassandra, and the Bulk Analytics library.  Hope this helps folks 
better understand how things work, and thanks for the patience as it took a bit 
longer than expected for me to find the time for this.

Doug

On Apr 5, 2023, at 11:18 AM, Doug Rohrer <droh...@apple.com> wrote:

Sorry for the delay in responding here - yes, we can add some diagrams to the 
CEP - I’ll try to get that done by end-of-week.

Thanks,

Doug

On Mar 28, 2023, at 1:14 PM, J. D. Jordan <jeremiah.jor...@gmail.com> wrote:

Maybe some data flow diagrams could be added to the cep showing some example 
operations for read/write?

On Mar 28, 2023, at 11:35 AM, Yifan Cai <yc25c...@gmail.com> wrote:


A lot of great discussions!

On the sidecar front, especially what the role sidecar plays in terms of this 
CEP, I feel there might be some confusion. Once the code is published, we 
should have clarity.
Sidecar does not read sstables nor do any coordination for analytics queries. 
It is local to the companion Cassandra instance. For bulk read, it takes 
snapshots and streams sstables to spark workers to read. For bulk write, it 
imports the sstables uploaded from spark workers. All commands are existing 
jmx/nodetool functionalities from Cassandra. Sidecar adds the http interface to 
them. It might be an over simplified description. The complex computation is 
performed in spark clusters only.

In the long run, Cassandra might evolve into a database that does both OLTP and 
OLAP. (Not what this thread aims for)
At the current stage, Spark is very suited for analytic purposes.

On Tue, Mar 28, 2023 at 9:06 AM Benedict 
<bened...@apache.org<mailto:bened...@apache.org>> wrote:
I disagree with the first claim, as the process has all the information it 
chooses to utilise about which resources it’s using and what it’s using those 
resources for.

The inability to isolate GC domains is something we cannot address, but also 
probably not a problem if we were doing everything with memory management as 
well as we could be.

But, not worth detailing this thread for. Today we do very little well on this 
front within the process, and a separate process is well justified given the 
state of play.

On 28 Mar 2023, at 16:38, Derek Chen-Becker 
<de...@chen-becker.org<mailto:de...@chen-becker.org>> wrote:



On Tue, Mar 28, 2023 at 9:03 AM Joseph Lynch 
<joe.e.ly...@gmail.com<mailto:joe.e.ly...@gmail.com>> wrote:
...

I think we might be underselling how valuable JVM isolation is,
especially for analytics queries that are going to pass the entire
dataset through heap somewhat constantly.

Big +1 here. The JVM simply does not have significant granularity of control 
for resource utilization, but this is explicitly a feature of separate 
processes. Add in being able to separate GC domains and you can avoid a lot of 
noisy neighbor in-VM behavior for the disparate workloads.

Cheers,

Derek


--
+---------------------------------------------------------------+
| Derek Chen-Becker                                             |
| GPG Key available at https://keybase.io/dchenbecker and       |
| https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
| Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
+---------------------------------------------------------------+



Reply via email to