[jira] [Updated] (CASSANDRA-16222) CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-19 Thread Dinesh Joshi (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dinesh Joshi updated CASSANDRA-16222:
-
Source Control Link: 
https://github.com/apache/cassandra-analytics/commit/1633cd9c6c3d88d5c66825fab76a369266509f7e
 Resolution: Fixed
 Status: Resolved  (was: Ready to Commit)

> CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics
> 
>
> Key: CASSANDRA-16222
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16222
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Tool/external
>Reporter: James Berragan
>Assignee: Francisco Guerrero
>Priority: Normal
> Fix For: NA
>
> Attachments: sparkbulkreader.patch
>
>
> *Description:*
> This ticket introduces the Spark-Cassandra Bulk Analytics library and 
> associated Cassandra Sidecar endpoints.
>  
> Please see 
> [CEP-28|https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-28%3A+Reading+and+Writing+Cassandra+Data+with+Spark+Bulk+Analytics]
>  for more details, motivation, and available source code links.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-16222) CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-19 Thread Dinesh Joshi (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dinesh Joshi updated CASSANDRA-16222:
-
Status: Ready to Commit  (was: Review In Progress)

+1

> CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics
> 
>
> Key: CASSANDRA-16222
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16222
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Tool/external
>Reporter: James Berragan
>Assignee: Francisco Guerrero
>Priority: Normal
> Fix For: NA
>
> Attachments: sparkbulkreader.patch
>
>
> *Description:*
> This ticket introduces the Spark-Cassandra Bulk Analytics library and 
> associated Cassandra Sidecar endpoints.
>  
> Please see 
> [CEP-28|https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-28%3A+Reading+and+Writing+Cassandra+Data+with+Spark+Bulk+Analytics]
>  for more details, motivation, and available source code links.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-16222) CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-18 Thread Yifan Cai (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yifan Cai updated CASSANDRA-16222:
--
Reviewers: Dinesh Joshi, Yifan Cai  (was: Yifan Cai)

> CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics
> 
>
> Key: CASSANDRA-16222
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16222
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Tool/external
>Reporter: James Berragan
>Assignee: Francisco Guerrero
>Priority: Normal
> Fix For: NA
>
> Attachments: sparkbulkreader.patch
>
>
> *Description:*
> This ticket introduces the Spark-Cassandra Bulk Analytics library and 
> associated Cassandra Sidecar endpoints.
>  
> Please see 
> [CEP-28|https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-28%3A+Reading+and+Writing+Cassandra+Data+with+Spark+Bulk+Analytics]
>  for more details, motivation, and available source code links.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-16222) CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-18 Thread Yifan Cai (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yifan Cai updated CASSANDRA-16222:
--
Reviewers: Yifan Cai
   Status: Review In Progress  (was: Patch Available)

> CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics
> 
>
> Key: CASSANDRA-16222
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16222
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Tool/external
>Reporter: James Berragan
>Assignee: Francisco Guerrero
>Priority: Normal
> Fix For: NA
>
> Attachments: sparkbulkreader.patch
>
>
> *Description:*
> This ticket introduces the Spark-Cassandra Bulk Analytics library and 
> associated Cassandra Sidecar endpoints.
>  
> Please see 
> [CEP-28|https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-28%3A+Reading+and+Writing+Cassandra+Data+with+Spark+Bulk+Analytics]
>  for more details, motivation, and available source code links.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-16222) CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-12 Thread Francisco Guerrero (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francisco Guerrero updated CASSANDRA-16222:
---
Test and Documentation Plan: Added unit tests and integration tests.
 Status: Patch Available  (was: In Progress)

Sidecar Patch:
PR: https://github.com/apache/cassandra-sidecar/pull/45
CI: https://app.circleci.com/pipelines/github/frankgh/cassandra-sidecar/135

Analytics Patch:
https://github.com/frankgh/cassandra-analytics

> CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics
> 
>
> Key: CASSANDRA-16222
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16222
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Tool/external
>Reporter: James Berragan
>Assignee: Francisco Guerrero
>Priority: Normal
> Fix For: NA
>
> Attachments: sparkbulkreader.patch
>
>
> *Description:*
> This ticket introduces the Spark-Cassandra Bulk Analytics library and 
> associated Cassandra Sidecar endpoints.
>  
> Please see 
> [CEP-28|https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-28%3A+Reading+and+Writing+Cassandra+Data+with+Spark+Bulk+Analytics]
>  for more details, motivation, and available source code links.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-16222) CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-04 Thread Doug Rohrer (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doug Rohrer updated CASSANDRA-16222:

Description: 
*Description:*
This ticket introduces the Spark-Cassandra Bulk Analytics library and 
associated Cassandra Sidecar endpoints.

 

Please see 
[CEP-28|https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-28%3A+Reading+and+Writing+Cassandra+Data+with+Spark+Bulk+Analytics]
 for more details, motivation, and available source code links.

  was:
*Description:*
 This ticket introduces the Spark-Cassandra Bulk Reader: a Spark library able 
to read and compact Cassandra raw sstables into SparkSQL along the principles 
of streaming compaction. The full code is attached as a patch file and will be 
submitted to a GitHub repo.

*Motivation*
 For analytics or "select *" use cases at scale, the performance is 
prohibitively expensive to read via the normal CQL read path - either using the 
Java driver or the Open Source Spark connector. By reading the raw sstables, 
the bulk reader is able to read with near-zero impact to a production cluster 
at speeds many orders of magnitudes faster than alternatives. We have seen very 
good performance results, exporting a 32TB table (~46bn CQL rows) to HDFS in 
Parquet format in around 1h10m; a 20x reduction compared to previous solutions. 
By reading from multiple replicas and ‘compacting’ duplicate data together, it 
can achieve consistency at a user defined level (i.e. ONE, TWO, LOCAL_QUORUM 
etc). 

*Overview*
 This library provides the core code for reading a set of SSTables into 
SparkSQL through a DataLayer abstraction. The role of the DataLayer is to:
 * return a SchemaStruct, mapping the Cassandra CQL table schema to the 
SparkSQL schema.
 * a list of sstables available for reading.
 * a method to open an InputStream for any file component of an sstable (e.g. 
data, compression, summary etc).

The PartitionedDataLayer abstraction builds on the DataLayer interface for 
partitioning Spark workers across a Cassandra token ring - allowing the Spark 
job to scale linearly - and reading from sufficient Cassandra replicas to 
achieve a user-specified consistency level.

A simple example LocalDataLayer implementation is included for reading from a 
local file system. Users of the library can build their own implementations to 
read from wherever they wish e.g. reading from a backup in a cloud storage 
system, or reading from the snapshot directory of a live Cassandra cluster.
  
 At the core, the bulk reader uses the Apache Cassandra CompactionIterator to 
perform the streaming compaction. As it iterates through the CompactionIterator 
it deserializes the ByteBuffers, converts into the appropriate SparkSQL data 
type and pivots each cell into a SparkSQL row.
  
 Supporting this library is a robust property-based test framework for writing 
Cassandra sstables with arbitrary schemas using the CQLSSTableWriter, and 
reading back through SparkSQL to verify the library achieves both consistency 
and correctness.

Summary: CEP-28: Reading and Writing Cassandra Data with Spark Bulk 
Analytics  (was: Spark-Cassandra Bulk Reader)

> CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics
> 
>
> Key: CASSANDRA-16222
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16222
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Tool/external
>Reporter: James Berragan
>Priority: Normal
> Fix For: NA
>
> Attachments: sparkbulkreader.patch
>
>
> *Description:*
> This ticket introduces the Spark-Cassandra Bulk Analytics library and 
> associated Cassandra Sidecar endpoints.
>  
> Please see 
> [CEP-28|https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-28%3A+Reading+and+Writing+Cassandra+Data+with+Spark+Bulk+Analytics]
>  for more details, motivation, and available source code links.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org