[DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

Doug Rohrer Thu, 23 Mar 2023 10:33:34 -0700

Hi everyone,

Wiki: 
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-28%3A+Reading+and+Writing+Cassandra+Data+with+Spark+Bulk+Analytics


We’d like to propose this CEP for adoption by the community.

It is common for teams using Cassandra to find themselves looking for a way to 
interact with large amounts of data for analytics workloads. However, 
Cassandra’s standard APIs aren’t designed for large scale data egress/ingest as 
the native read/write paths weren’t designed for bulk analytics.

We’re proposing this CEP for this exact purpose. It enables the implementation 
of custom Spark (or similar) applications that can either read or write large 
amounts of Cassandra data at line rates, by accessing the persistent storage of 
nodes in the cluster via the Cassandra Sidecar.

This CEP proposes new APIs in the Cassandra Sidecar and a companion library 
that allows deep integration into Apache Spark that allows its users to bulk 
import or export data from a running Cassandra cluster with minimal to no 
impact to the read/write traffic.

We will shortly publish a branch with code that will accompany this CEP to help 
readers understand it better.

As a reminder, please keep the discussion here on the dev list vs. in the wiki, 
as we’ve found it easier to manage via email.

Sincerely,

Doug Rohrer & James Berragan

[DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

Reply via email to