[
https://issues.apache.org/jira/browse/SPARK-16616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon updated SPARK-16616:
---------------------------------
Labels: bulk-closed (was: )
> Allow Catalyst to take Advantage of Hash Partitioned DataSources
> ----------------------------------------------------------------
>
> Key: SPARK-16616
> URL: https://issues.apache.org/jira/browse/SPARK-16616
> Project: Spark
> Issue Type: New Feature
> Reporter: Russell Spitzer
> Priority: Major
> Labels: bulk-closed
>
> Many Distributed Databases provide hash partitioned data (in contrast to data
> partitioned on a specific column) and this information can be used to greatly
> enhance Spark performance.
> For example:
> Data within Cassandra is distributed based on a Hash of the "Partition Key"
> which is a set of columns. This means all values read from the database which
> contain the same "partition key" will exist in the same Spark Partition. When
> these rows are joined with themselves or aggregated on these "Partition Key"
> columns there is no need to do a shuffle.
> {code}
> CREATE TABLE (UserID int, purchase int, amount int, PRIMARY KEY (customer,
> purchase))
> {code}
> Would internally (using the SparkCassandraConnector) make an RDD that looks
> like
> {code}
> Spark Partition 1 : (1, 1, 5), (1, 2, 6), (432, 1, 10) ....
> Spark Partition 2 : (2, 1, 4), (2, 2, 5), (700, 1, 1) ...
> {code}
> Where the all values for {{UserID}} 1 are in the First Partition but the
> values contained within Spark Partition 1 do not cover a contiguous range of
> values for {{UserID}}
> Like with normal RDDs, it would be nice if we could expose a Partitioning
> function that (given the key value) we could indicate what partition the row
> would be in. This information could also be used in aggregates and joins.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]