[
https://issues.apache.org/jira/browse/CASSANDRA-16222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219896#comment-17219896
]
DuyHai Doan commented on CASSANDRA-16222:
-----------------------------------------
When first reading this JIRA, it sounded very great !
Then I was reading the doc on the Github repo and stumbled on this sentence
{quote}The PartitionedDataLayer currently assumes 1 token per Cassandra
instance so *will have unexpected behavior with virtual nodes.*
{quote}
The devils are in the details, this limitation makes this great Spark library
unusable for most Cassandra users out there ...
> Spark-Cassandra Bulk Reader
> ---------------------------
>
> Key: CASSANDRA-16222
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16222
> Project: Cassandra
> Issue Type: New Feature
> Components: Tool/external
> Reporter: James Berragan
> Priority: Normal
> Fix For: NA
>
> Attachments: sparkbulkreader.patch
>
>
> *Description:*
> This ticket introduces the Spark-Cassandra Bulk Reader: a Spark library able
> to read and compact Cassandra raw sstables into SparkSQL along the principles
> of streaming compaction. The full code is attached as a patch file and will
> be submitted to a GitHub repo.
> *Motivation*
> For analytics or "select *" use cases at scale, the performance is
> prohibitively expensive to read via the normal CQL read path - either using
> the Java driver or the Open Source Spark connector. By reading the raw
> sstables, the bulk reader is able to read with near-zero impact to a
> production cluster at speeds many orders of magnitudes faster than
> alternatives. We have seen very good performance results, exporting a 32TB
> table (~46bn CQL rows) to HDFS in Parquet format in around 1h10m; a 20x
> reduction compared to previous solutions. By reading from multiple replicas
> and ‘compacting’ duplicate data together, it can achieve consistency at a
> user defined level (i.e. ONE, TWO, LOCAL_QUORUM etc).
> *Overview*
> This library provides the core code for reading a set of SSTables into
> SparkSQL through a DataLayer abstraction. The role of the DataLayer is to:
> * return a SchemaStruct, mapping the Cassandra CQL table schema to the
> SparkSQL schema.
> * a list of sstables available for reading.
> * a method to open an InputStream for any file component of an sstable (e.g.
> data, compression, summary etc).
> The PartitionedDataLayer abstraction builds on the DataLayer interface for
> partitioning Spark workers across a Cassandra token ring - allowing the Spark
> job to scale linearly - and reading from sufficient Cassandra replicas to
> achieve a user-specified consistency level.
> A simple example LocalDataLayer implementation is included for reading from a
> local file system. Users of the library can build their own implementations
> to read from wherever they wish e.g. reading from a backup in a cloud storage
> system, or reading from the snapshot directory of a live Cassandra cluster.
>
> At the core, the bulk reader uses the Apache Cassandra CompactionIterator to
> perform the streaming compaction. As it iterates through the
> CompactionIterator it deserializes the ByteBuffers, converts into the
> appropriate SparkSQL data type and pivots each cell into a SparkSQL row.
>
> Supporting this library is a robust property-based test framework for
> writing Cassandra sstables with arbitrary schemas using the CQLSSTableWriter,
> and reading back through SparkSQL to verify the library achieves both
> consistency and correctness.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]