[
https://issues.apache.org/jira/browse/SPARK-16753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15397223#comment-15397223
]
Jurriaan Pruis edited comment on SPARK-16753 at 7/28/16 8:02 AM:
-----------------------------------------------------------------
[~rxin]
I've set the following options:
{code}
spark.memory.offHeap.enabled true
spark.memory.offHeap.size 1500M
{code}
Still getting errors like:
{code}
org.apache.spark.shuffle.FetchFailedException: Too large frame: 2690834287
{code}
{code}
java.lang.OutOfMemoryError: Unable to acquire 400 bytes of memory, got 0
{code}
And
{code}
ExecutorLostFailure (executor 23 exited caused by one of the running tasks)
Reason: Container marked as failed: container_1469605743211_0916_01_000052 on
host: ip-172-31-33-133.eu-central-1.compute.internal. Exit status: 52.
Diagnostics: Exception from container-launch.
{code}
That task with the OutOfMemoryError:
{code}
Memory spill: 1808.1 MB
Shuffle Read MB/Records: 656.9 MB / 5040711
{code}
There were tasks with way more memory spill that were processed just fine, see
screenshot.
was (Author: jurriaanpruis):
[~rxin]
I've set the following options:
{code}
spark.memory.offHeap.enabled true
spark.memory.offHeap.size 1500M
{code}
Still getting errors like:
{code}
org.apache.spark.shuffle.FetchFailedException: Too large frame: 2690834287
{code}
{code}
java.lang.OutOfMemoryError: Unable to acquire 400 bytes of memory, got 0
{code}
And
{code}
ExecutorLostFailure (executor 23 exited caused by one of the running tasks)
Reason: Container marked as failed: container_1469605743211_0916_01_000052 on
host: ip-172-31-33-133.eu-central-1.compute.internal. Exit status: 52.
Diagnostics: Exception from container-launch.
{code}
That task with the OutOfMemoryError:
{code}
Memory spill: 1808.1 MB
Shuffle Read MB/Records: 656.9 MB / 5040711
{code}
There were tasks with way more memory spill, see screenshot.
> Spark SQL doesn't handle skewed dataset joins properly
> ------------------------------------------------------
>
> Key: SPARK-16753
> URL: https://issues.apache.org/jira/browse/SPARK-16753
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.6.1, 1.6.2, 2.0.0, 2.0.1
> Reporter: Jurriaan Pruis
> Attachments: screenshot-1.png
>
>
> I'm having issues with joining a 1 billion row dataframe with skewed data
> with multiple dataframes with sizes ranging from 100,000 to 10 million rows.
> This means some of the joins (about half of them) can be done using
> broadcast, but not all.
> Because the data in the large dataframe is skewed we get out of memory errors
> in the executors or errors like:
> `org.apache.spark.shuffle.FetchFailedException: Too large frame`.
> We tried a lot of things, like broadcast joining the skewed rows separately
> and unioning them with the dataset containing the sort merge joined data.
> Which works perfectly when doing one or two joins, but when doing 10 joins
> like this the query planner gets confused (see [SPARK-15326]).
> As most of the rows are skewed on the NULL value we use a hack where we put
> unique values in those NULL columns so the data is properly distributed over
> all partitions. This works fine for NULL values, but since this table is
> growing rapidly and we have skewed data for non-NULL values as well this
> isn't a full solution to the problem.
> Right now this specific spark task runs well 30% of the time and it's getting
> worse and worse because of the increasing amount of data.
> How to approach these kinds of joins using Spark? It seems weird that I can't
> find proper solutions for this problem/other people having the same kind of
> issues when Spark profiles itself as a large-scale data processing engine.
> Doing joins on big datasets should be a thing Spark should have no problem
> with out of the box.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]