[jira] [Resolved] (SPARK-11387) minimize shuffles during joins by using existing partitions and bundling messages

Hyukjin Kwon (JIRA) Mon, 20 May 2019 22:10:13 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-11387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hyukjin Kwon resolved SPARK-11387.
----------------------------------
    Resolution: Incomplete

> minimize shuffles during joins by using existing partitions and bundling 
> messages
> ---------------------------------------------------------------------------------
>
>                 Key: SPARK-11387
>                 URL: https://issues.apache.org/jira/browse/SPARK-11387
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Glenn Strycker
>            Priority: Major
>              Labels: bulk-closed
>
> Currently an RDD join in Spark requires repartitioning by the join key (for 
> large RDDs that cannot use broadcast).
> This is very bad for highly skewed data, as every row containing a particular 
> key will end up on one node.
> Additionally, repartitioning is expensive, and the existing partitioning 
> scheme may have been optimized to minimize message passing.  For example, 
> perhaps an RDD is an edge list for a graph, but a user has already 
> partitioned this data by a community structure or connected components, 
> ensuring that similar edges are on the same partition.  Using a join 
> operation to perform message passing will require repartitioning the edge 
> list by the first or second vertex in the edge as a key.
> Instead of repartitioning and shuffling, could messages across partitions be 
> "bundled" together and passed once, almost like a broadcast operation?
> Essentially the request here is to treat ALL RDDs of any size as 
> broadcast-capable, and each partition would be broadcast one and at a time 
> and the results aggregated.  It would be up to the user to optimize the 
> partitioning to minimize the between-partition message passing volume.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (SPARK-11387) minimize shuffles during joins by using existing partitions and bundling messages

Reply via email to