James Berragan created CASSANDRA-16222:
------------------------------------------
Summary: Spark-Cassandra Bulk Reader
Key: CASSANDRA-16222
URL: https://issues.apache.org/jira/browse/CASSANDRA-16222
Project: Cassandra
Issue Type: New Feature
Reporter: James Berragan
Attachments: sparkbulkreader.patch
*Description:*
This ticket introduces the Spark-Cassandra Bulk Reader: a Spark library able to
read and compact Cassandra raw sstables into SparkSQL along the principles of
streaming compaction. The full code is attached as a patch file and will be
submitted to a GitHub repo.
**
*Motivation*
For analytics or "select *" use cases at scale, the performance is
prohibitively expensive to read via the normal CQL read path - either using the
Java driver or the Open Source Spark connector. By reading the raw sstables,
the bulk reader is able to read with near-zero impact to a production cluster
at speeds many orders of magnitudes faster than alternatives. We have seen very
good performance results, exporting a 32TB table (~46bn CQL rows) to HDFS in
Parquet format in around 1h10m; a 20x reduction compared to previous solutions.
By reading from multiple replicas and ‘compacting’ duplicate data together, it
can achieve consistency at a user defined level (i.e. ONE, TWO, LOCAL_QUORUM
etc).
*Overview*
This library provides the core code for reading a set of SSTables into SparkSQL
through an a DataLayer abstraction. The role of the DataLayer is to:
* return a SchemaStruct, mapping the Cassandra CQL table schema to the
SparkSQL schema.
* a list of sstables available for reading.
* a method to open an InputStream for any file component of an sstable (e.g.
data, compression, summary etc).
The PartitionedDataLayer abstraction builds on the DataLayer interface for
partitioning Spark workers across a Cassandra token ring - allowing the Spark
job to scale linearly - and reading from sufficient Cassandra replicas to
achieve a user-specified consistency level.
A simple example LocalDataLayer implementation is included for reading from a
local file system. Users of the library can build their own implementations to
read from wherever they wish e.g. reading from a backup in a cloud storage
system, or reading from the snapshot directory of a live Cassandra cluster.
At the core, the Bulk Reader uses the Apache Cassandra CompactionIterator to
perform the streaming compaction. As it iterates through the CompactionIterator
it deserializes the ByteBuffers, converts into the appropriate SparkSQL data
type and pivots each cell into a SparkSQL row.
Supporting this library is a robust property-based test framework for writing
Cassandra sstables with arbitrary schemas using the CQLSSTableWriter, and
reading back through SparkSQL to verify the library achieves both consistency
and correctness.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]