Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

James Berragan Mon, 27 Mar 2023 17:56:55 -0700

Complex predicates on non-partition keys naturally require pulling the entire 
data set into the Spark DataFrame to perform the query. We have some 
optimizations around column filtering and partition key predicates, utilizing 
the Filter.db/Summary.db/Index.db files to only read the data it needs. We have 
chatted to Caleb about utilizing the index file for SAIs but at present it is 
purely theoretical.


In terms of internals, beyond some util/serializer classes, the writer part 
depends on the CQLSSTableWriter and the reader uses the SSTableSimpleIterator 
and the CompactionIterator.

James.

> On Mar 27, 2023, at 3:06 PM, Jeremy Hanna <jeremy.hanna1...@gmail.com> wrote:
> 
> Thank you for the write-up and the efforts on CASSANDRA-16222.  It sounds 
> like you've been using this for some time.  I understand from the rejected 
> alternatives that the Spark Cassandra Connector was slower because it goes 
> through the read and write path for C* rather than this backdoor mechanism.  
> In your experience using this, under what circumstances have you seen that 
> this tool is not a good fit for analytics - such as complex predicates?  The 
> challenge with the Spark Cassandra Connector and previously the Hadoop 
> integration is that it had to do full table scans even to get small amounts 
> of data.  It sounds like this is similar in that it has to do a full table 
> scan but with the advantage of being faster and less load on the cluster.  In 
> other words, I'm asking if this has been a replacement for the Spark 
> Cassandra Connector or if there are cases in your work where SCC is a better 
> fit.
> 
> Also to Benjamin's point in the comments on the CEP itself, how coupled is 
> this to internals?  Are there going to be higher level APIs or is it going to 
> call internal storage classes directly?
> 
> Thanks!
> 
> Jeremy
> 
> 
>> On Mar 23, 2023, at 12:33 PM, Doug Rohrer <droh...@apple.com> wrote:
>> 
>> Hi everyone,
>> 
>> Wiki: 
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-28%3A+Reading+and+Writing+Cassandra+Data+with+Spark+Bulk+Analytics
>> 
>> We’d like to propose this CEP for adoption by the community.
>> 
>> It is common for teams using Cassandra to find themselves looking for a way 
>> to interact with large amounts of data for analytics workloads. However, 
>> Cassandra’s standard APIs aren’t designed for large scale data egress/ingest 
>> as the native read/write paths weren’t designed for bulk analytics.
>> 
>> We’re proposing this CEP for this exact purpose. It enables the 
>> implementation of custom Spark (or similar) applications that can either 
>> read or write large amounts of Cassandra data at line rates, by accessing 
>> the persistent storage of nodes in the cluster via the Cassandra Sidecar.
>> 
>> This CEP proposes new APIs in the Cassandra Sidecar and a companion library 
>> that allows deep integration into Apache Spark that allows its users to bulk 
>> import or export data from a running Cassandra cluster with minimal to no 
>> impact to the read/write traffic.
>> 
>> We will shortly publish a branch with code that will accompany this CEP to 
>> help readers understand it better.
>> 
>> As a reminder, please keep the discussion here on the dev list vs. in the 
>> wiki, as we’ve found it easier to manage via email.
>> 
>> Sincerely,
>> 
>> Doug Rohrer & James Berragan
>

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

Reply via email to