[
https://issues.apache.org/jira/browse/SPARK-11387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon resolved SPARK-11387.
----------------------------------
Resolution: Incomplete
> minimize shuffles during joins by using existing partitions and bundling
> messages
> ---------------------------------------------------------------------------------
>
> Key: SPARK-11387
> URL: https://issues.apache.org/jira/browse/SPARK-11387
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Reporter: Glenn Strycker
> Priority: Major
> Labels: bulk-closed
>
> Currently an RDD join in Spark requires repartitioning by the join key (for
> large RDDs that cannot use broadcast).
> This is very bad for highly skewed data, as every row containing a particular
> key will end up on one node.
> Additionally, repartitioning is expensive, and the existing partitioning
> scheme may have been optimized to minimize message passing. For example,
> perhaps an RDD is an edge list for a graph, but a user has already
> partitioned this data by a community structure or connected components,
> ensuring that similar edges are on the same partition. Using a join
> operation to perform message passing will require repartitioning the edge
> list by the first or second vertex in the edge as a key.
> Instead of repartitioning and shuffling, could messages across partitions be
> "bundled" together and passed once, almost like a broadcast operation?
> Essentially the request here is to treat ALL RDDs of any size as
> broadcast-capable, and each partition would be broadcast one and at a time
> and the results aggregated. It would be up to the user to optimize the
> partitioning to minimize the between-partition message passing volume.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]