[ 
https://issues.apache.org/jira/browse/SPARK-4644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14608578#comment-14608578
 ] 

David Sabater commented on SPARK-4644:
--------------------------------------

I think the point here is to minimise the shuffling produced by the skewed key, 
i.e. one executor receives all tuples based on the skewed key.
As the groupBy is not using information about partitioning and ii's known to be 
not optimal anyway, it makes sense to focus on improving this.

The point is how other MPP engines are handling skewed data? As an example 
Greenplum just partitions based on your distribution key and leverages storage 
rather than memory to proces the queries. Kind of brute force approach, 
broadcasting the other tables assumed smaller and doing the sort-merge join 
locally. I believe this is the way GPDB is doing it but worth checking this to 
see if helps so solve this challenge.

> Implement skewed join
> ---------------------
>
>                 Key: SPARK-4644
>                 URL: https://issues.apache.org/jira/browse/SPARK-4644
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Shixiong Zhu
>         Attachments: Skewed Join Design Doc.pdf
>
>
> Skewed data is not rare. For example, a book recommendation site may have 
> several books which are liked by most of the users. Running ALS on such 
> skewed data will raise a OutOfMemory error, if some book has too many users 
> which cannot be fit into memory. To solve it, we propose a skewed join 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to