Hi everyone, Wiki: https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-28%3A+Reading+and+Writing+Cassandra+Data+with+Spark+Bulk+Analytics
We’d like to propose this CEP for adoption by the community. It is common for teams using Cassandra to find themselves looking for a way to interact with large amounts of data for analytics workloads. However, Cassandra’s standard APIs aren’t designed for large scale data egress/ingest as the native read/write paths weren’t designed for bulk analytics. We’re proposing this CEP for this exact purpose. It enables the implementation of custom Spark (or similar) applications that can either read or write large amounts of Cassandra data at line rates, by accessing the persistent storage of nodes in the cluster via the Cassandra Sidecar. This CEP proposes new APIs in the Cassandra Sidecar and a companion library that allows deep integration into Apache Spark that allows its users to bulk import or export data from a running Cassandra cluster with minimal to no impact to the read/write traffic. We will shortly publish a branch with code that will accompany this CEP to help readers understand it better. As a reminder, please keep the discussion here on the dev list vs. in the wiki, as we’ve found it easier to manage via email. Sincerely, Doug Rohrer & James Berragan