[jira] [Updated] (SPARK-16616) Allow Catalyst to take Advantage of Hash Partitioned DataSources

Hyukjin Kwon (JIRA) Mon, 20 May 2019 21:48:07 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-16616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hyukjin Kwon updated SPARK-16616:
---------------------------------
    Labels: bulk-closed  (was: )

> Allow Catalyst to take Advantage of Hash Partitioned DataSources
> ----------------------------------------------------------------
>
>                 Key: SPARK-16616
>                 URL: https://issues.apache.org/jira/browse/SPARK-16616
>             Project: Spark
>          Issue Type: New Feature
>            Reporter: Russell Spitzer
>            Priority: Major
>              Labels: bulk-closed
>
> Many Distributed Databases provide hash partitioned data (in contrast to data 
> partitioned on a specific column) and this information can be used to greatly 
> enhance Spark performance. 
> For example: 
> Data within Cassandra is distributed based on a Hash of the "Partition Key" 
> which is a set of columns. This means all values read from the database which 
> contain the same "partition key" will exist in the same Spark Partition. When 
> these rows are joined with themselves or aggregated on these "Partition Key" 
> columns there is no need to do a shuffle.
> {code}
> CREATE TABLE (UserID int, purchase int, amount int, PRIMARY KEY (customer, 
> purchase))
> {code}
> Would internally (using the SparkCassandraConnector) make an RDD that looks 
> like
> {code}
> Spark Partition 1 :  (1, 1, 5), (1, 2, 6), (432, 1, 10) .... 
> Spark Partition 2 :  (2, 1, 4), (2, 2, 5), (700, 1, 1) ...
> {code}
> Where the all values for {{UserID}} 1 are in the First Partition but the 
> values contained within Spark Partition 1 do not cover a contiguous range of 
> values for {{UserID}}
> Like with normal RDDs, it would be nice if we could expose a Partitioning 
> function that (given the key value) we could indicate what partition the row 
> would be in. This information could also be used in aggregates and joins. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-16616) Allow Catalyst to take Advantage of Hash Partitioned DataSources

Reply via email to