[jira] [Comment Edited] (SPARK-16753) Spark SQL doesn't handle skewed dataset joins properly

Jurriaan Pruis (JIRA) Thu, 28 Jul 2016 01:03:39 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-16753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15397223#comment-15397223
 ]


Jurriaan Pruis edited comment on SPARK-16753 at 7/28/16 8:02 AM:
-----------------------------------------------------------------

[~rxin] 
I've set the following options:
{code}
spark.memory.offHeap.enabled    true
spark.memory.offHeap.size       1500M
{code}

Still getting errors like:
{code}
org.apache.spark.shuffle.FetchFailedException: Too large frame: 2690834287
{code}

{code}
java.lang.OutOfMemoryError: Unable to acquire 400 bytes of memory, got 0
{code}

And

{code}
ExecutorLostFailure (executor 23 exited caused by one of the running tasks) 
Reason: Container marked as failed: container_1469605743211_0916_01_000052 on 
host: ip-172-31-33-133.eu-central-1.compute.internal. Exit status: 52. 
Diagnostics: Exception from container-launch.
{code}

That task with the OutOfMemoryError:
{code}
Memory spill: 1808.1 MB
Shuffle Read MB/Records:        656.9 MB / 5040711
{code}

There were tasks with way more memory spill that were processed just fine, see 
screenshot.




was (Author: jurriaanpruis):
[~rxin] 
I've set the following options:
{code}
spark.memory.offHeap.enabled    true
spark.memory.offHeap.size       1500M
{code}

Still getting errors like:
{code}
org.apache.spark.shuffle.FetchFailedException: Too large frame: 2690834287
{code}

{code}
java.lang.OutOfMemoryError: Unable to acquire 400 bytes of memory, got 0
{code}

And

{code}
ExecutorLostFailure (executor 23 exited caused by one of the running tasks) 
Reason: Container marked as failed: container_1469605743211_0916_01_000052 on 
host: ip-172-31-33-133.eu-central-1.compute.internal. Exit status: 52. 
Diagnostics: Exception from container-launch.
{code}

That task with the OutOfMemoryError:
{code}
Memory spill: 1808.1 MB
Shuffle Read MB/Records:        656.9 MB / 5040711
{code}

There were tasks with way more memory spill, see screenshot.



> Spark SQL doesn't handle skewed dataset joins properly
> ------------------------------------------------------
>
>                 Key: SPARK-16753
>                 URL: https://issues.apache.org/jira/browse/SPARK-16753
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.1, 1.6.2, 2.0.0, 2.0.1
>            Reporter: Jurriaan Pruis
>         Attachments: screenshot-1.png
>
>
> I'm having issues with joining a 1 billion row dataframe with skewed data 
> with multiple dataframes with sizes ranging from 100,000 to 10 million rows. 
> This means some of the joins (about half of them) can be done using 
> broadcast, but not all.
> Because the data in the large dataframe is skewed we get out of memory errors 
> in the executors or errors like: 
> `org.apache.spark.shuffle.FetchFailedException: Too large frame`.
> We tried a lot of things, like broadcast joining the skewed rows separately 
> and unioning them with the dataset containing the sort merge joined data. 
> Which works perfectly when doing one or two joins, but when doing 10 joins 
> like this the query planner gets confused (see [SPARK-15326]).
> As most of the rows are skewed on the NULL value we use a hack where we put 
> unique values in those NULL columns so the data is properly distributed over 
> all partitions. This works fine for NULL values, but since this table is 
> growing rapidly and we have skewed data for non-NULL values as well this 
> isn't a full solution to the problem.
> Right now this specific spark task runs well 30% of the time and it's getting 
> worse and worse because of the increasing amount of data.
> How to approach these kinds of joins using Spark? It seems weird that I can't 
> find proper solutions for this problem/other people having the same kind of 
> issues when Spark profiles itself as a large-scale data processing engine. 
> Doing joins on big datasets should be a thing Spark should have no problem 
> with out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-16753) Spark SQL doesn't handle skewed dataset joins properly

Reply via email to