[ 
https://issues.apache.org/jira/browse/CASSANDRA-16222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219221#comment-17219221
 ] 

Brandon Williams commented on CASSANDRA-16222:
----------------------------------------------

First, thanks for this!  This look fantastic and will be extremely useful to 
many users.

I tried taking a look but had issues applying the patch:

{code}
$ git apply sparkbulkreader.patch
sparkbulkreader.patch:353: trailing whitespace.
By reading the raw SSTables directly, the Cassandra-Spark Bulk Reader enables 
efficient and fast massive-scale analytics queries without impacting the 
performance of a production Cassandra cluster. 
sparkbulkreader.patch:376: trailing whitespace.
  
sparkbulkreader.patch:380: trailing whitespace.
The project is robustly tested using a bespoke property-based testing system 
that uses [QuickTheories](https://github.com/quicktheories/QuickTheories) to 
enumerate many Cassandra CQL schemas, write random data using the Cassandra 
[CQLSSTableWriter](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/CQLSSTableWriter.java),
 read the SSTables into SparkSQL and verify the resulting SparkSQL rows match 
the expected.  
error: corrupt patch at line 21241
{/code}

Could you resend or post a repo on github?

> Spark-Cassandra Bulk Reader
> ---------------------------
>
>                 Key: CASSANDRA-16222
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16222
>             Project: Cassandra
>          Issue Type: New Feature
>            Reporter: James Berragan
>            Priority: Normal
>         Attachments: sparkbulkreader.patch
>
>
> *Description:*
>  This ticket introduces the Spark-Cassandra Bulk Reader: a Spark library able 
> to read and compact Cassandra raw sstables into SparkSQL along the principles 
> of streaming compaction. The full code is attached as a patch file and will 
> be submitted to a GitHub repo.
> *Motivation*
>  For analytics or "select *" use cases at scale, the performance is 
> prohibitively expensive to read via the normal CQL read path - either using 
> the Java driver or the Open Source Spark connector. By reading the raw 
> sstables, the bulk reader is able to read with near-zero impact to a 
> production cluster at speeds many orders of magnitudes faster than 
> alternatives. We have seen very good performance results, exporting a 32TB 
> table (~46bn CQL rows) to HDFS in Parquet format in around 1h10m; a 20x 
> reduction compared to previous solutions. By reading from multiple replicas 
> and ‘compacting’ duplicate data together, it can achieve consistency at a 
> user defined level (i.e. ONE, TWO, LOCAL_QUORUM etc). 
> *Overview*
>  This library provides the core code for reading a set of SSTables into 
> SparkSQL through a DataLayer abstraction. The role of the DataLayer is to:
>  * return a SchemaStruct, mapping the Cassandra CQL table schema to the 
> SparkSQL schema.
>  * a list of sstables available for reading.
>  * a method to open an InputStream for any file component of an sstable (e.g. 
> data, compression, summary etc).
> The PartitionedDataLayer abstraction builds on the DataLayer interface for 
> partitioning Spark workers across a Cassandra token ring - allowing the Spark 
> job to scale linearly - and reading from sufficient Cassandra replicas to 
> achieve a user-specified consistency level.
> A simple example LocalDataLayer implementation is included for reading from a 
> local file system. Users of the library can build their own implementations 
> to read from wherever they wish e.g. reading from a backup in a cloud storage 
> system, or reading from the snapshot directory of a live Cassandra cluster.
>   
>  At the core, the bulk reader uses the Apache Cassandra CompactionIterator to 
> perform the streaming compaction. As it iterates through the 
> CompactionIterator it deserializes the ByteBuffers, converts into the 
> appropriate SparkSQL data type and pivots each cell into a SparkSQL row.
>   
>  Supporting this library is a robust property-based test framework for 
> writing Cassandra sstables with arbitrary schemas using the CQLSSTableWriter, 
> and reading back through SparkSQL to verify the library achieves both 
> consistency and correctness.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to