[
https://issues.apache.org/jira/browse/SPARK-4644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14228839#comment-14228839
]
Aaron Davidson commented on SPARK-4644:
---------------------------------------
[~zsxwing] I believe that this problem is related more fundamentally to the
problem that Spark currently requires that all values for the same key remain
in memory. Your solution aims to fix this for the specific case of joins, but I
wonder if we generalize it, if we could solve this for things like groupBy as
well.
I don't have a fully fleshed out idea yet, but I was considering a model where
there are 2 types of shuffles: aggregation-based and rearrangement-based.
Aggregation-based shuffles use partial aggregation and combiners to form and
merge (K, C) pairs. Rearrangement-based shuffles do not expect a decrease in
the amount of total data, however, and so my thought is that this model does
not make sense.
Instead, we could provide an interface similar to ExternalAppendOnlyMap but
which returns an Iterator[(K, Iterable[V])] pairs, with some extra semantics
related to the Iterable[V]s (such as having a .chunkedIterator() method which
enables block nested loops join).
In this model, join could be implemented by mapping the left side's key to (K,
1) and the right side to (K, 2) and having logic which reads from two adjacent
value-iterables simultaneously -- e.g.,
val ((k, 1), left: Iterable[V]) = map.next()
val ((k, 2), right: Iterable[V]) = map.next()
// perform merge using the left and right iterators.
> Implement skewed join
> ---------------------
>
> Key: SPARK-4644
> URL: https://issues.apache.org/jira/browse/SPARK-4644
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Reporter: Shixiong Zhu
> Attachments: Skewed Join Design Doc.pdf
>
>
> Skewed data is not rare. For example, a book recommendation site may have
> several books which are liked by most of the users. Running ALS on such
> skewed data will raise a OutOfMemory error, if some book has too many users
> which cannot be fit into memory. To solve it, we propose a skewed join
> implementation.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]