Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

J. D. Jordan Tue, 11 Apr 2023 11:55:53 -0700

Thanks for those. They are very helpful.

I think the CEP needs to call out all of the classes/interfaces from the cassandra-all jar that the “Spark driver” is using.

Given this CEP is exposing “sstables as an external API” I would think all the interfaces and code associated with using those would need to be treated as user API now?

For example the spark driver is actually calling the compaction classes and using the internal C* objects to process the data. I don’t think any of those classes have previously been considered “public” in anyway.

Is said spark driver also being donated as part of the CEP? Or just the code to implement the interfaces in the side car?

-Jeremiah

On Apr 10, 2023, at 5:37 PM, Doug Rohrer <[email protected]> wrote:

I’ve updated the CEP with two overview diagrams of the interactions between Sidecar, Cassandra, and the Bulk Analytics library. Hope this helps folks better understand how things work, and thanks for the patience as it took a bit longer than expected for me to find the time for this.

Doug

On Apr 5, 2023, at 11:18 AM, Doug Rohrer <[email protected]> wrote:

Sorry for the delay in responding here - yes, we can add some diagrams to the CEP - I’ll try to get that done by end-of-week.

Thanks,

Doug

On Mar 28, 2023, at 1:14 PM, J. D. Jordan <[email protected]> wrote:

Maybe some data flow diagrams could be added to the cep showing some example operations for read/write?

On Mar 28, 2023, at 11:35 AM, Yifan Cai <[email protected]> wrote:

A lot of great discussions!

On the sidecar front, especially what the role sidecar plays in terms of this CEP, I feel there might be some confusion. Once the code is published, we should have clarity.
Sidecar does not read sstables nor do any coordination for analytics queries. It is local to the companion Cassandra instance. For bulk read, it takes snapshots and streams sstables to spark workers to read. For bulk write, it imports the sstables uploaded from spark workers. All commands are existing jmx/nodetool functionalities from Cassandra. Sidecar adds the http interface to them. It might be an over simplified description. The complex computation is performed in spark clusters only.

In the long run, Cassandra might evolve into a database that does both OLTP and OLAP. (Not what this thread aims for)
At the current stage, Spark is very suited for analytic purposes.

On Tue, Mar 28, 2023 at 9:06 AM Benedict <[email protected]> wrote:
I disagree with the first claim, as the process has all the information it chooses to utilise about which resources it’s using and what it’s using those resources for.

The inability to isolate GC domains is something we cannot address, but also probably not a problem if we were doing everything with memory management as well as we could be.

But, not worth detailing this thread for. Today we do very little well on this front within the process, and a separate process is well justified given the state of play.

On 28 Mar 2023, at 16:38, Derek Chen-Becker <[email protected]> wrote:

On Tue, Mar 28, 2023 at 9:03 AM Joseph Lynch <[email protected]> wrote:
...

I think we might be underselling how valuable JVM isolation is,
especially for analytics queries that are going to pass the entire
dataset through heap somewhat constantly.

Big +1 here. The JVM simply does not have significant granularity of control for resource utilization, but this is explicitly a feature of separate processes. Add in being able to separate GC domains and you can avoid a lot of noisy neighbor in-VM behavior for the disparate workloads.

Cheers,

Derek

--
+---------------------------------------------------------------+
| Derek Chen-Becker |
| GPG Key available at https://keybase.io/dchenbecker and |
| https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
| Fngrprnt: EB8A 6480 F0A3 C8EB C1E7 7F42 AFC5 AFEE 96E4 6ACC |
+---------------------------------------------------------------+

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

Reply via email to