[
https://issues.apache.org/jira/browse/SPARK-31635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152674#comment-17152674
]
George George commented on SPARK-31635:
---------------------------------------
Hello [~Chen Zhang],
Thanks a lot for getting back on this.
I would agree with you that it is an improvement. However, I thought because it
failed when using dataframe api and there is no documentation on it, that it is
a bug.
Your suggestion sounds really good to me and I think it's good to give the user
the opportunity to configure this. Basically, then the user can decide if he
waits a little more on the result or put more pressure on the driver.
I could also try to submit a PR, but I guess I would need a more time on it.
Just let me know if you would rather wait for my pr or do it yourself.
Best,
George
> Spark SQL Sort fails when sorting big data points
> -------------------------------------------------
>
> Key: SPARK-31635
> URL: https://issues.apache.org/jira/browse/SPARK-31635
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 2.3.2
> Reporter: George George
> Priority: Major
>
> Please have a look at the example below:
> {code:java}
> case class Point(x:Double, y:Double)
> case class Nested(a: Long, b: Seq[Point])
> val test = spark.sparkContext.parallelize((1L to 100L).map(a =>
> Nested(a,Seq.fill[Point](250000)(Point(1,2)))), 100)
> test.toDF().as[Nested].sort("a").take(1)
> {code}
> *Sorting* big data objects using *Spark Dataframe* is failing with following
> exception:
> {code:java}
> 2020-05-04 08:01:00 ERROR TaskSetManager:70 - Total size of serialized
> results of 14 tasks (107.8 MB) is bigger than spark.driver.maxResultSize
> (100.0 MB)
> [Stage 0:======> (12 + 3) /
> 100]org.apache.spark.SparkException: Job aborted due to stage failure: Total
> size of serialized results of 13 tasks (100.1 MB) is bigger than
> spark.driver.maxResu
> {code}
> However using the *RDD API* is working and no exception is thrown:
> {code:java}
> case class Point(x:Double, y:Double)
> case class Nested(a: Long, b: Seq[Point])
> val test = spark.sparkContext.parallelize((1L to 100L).map(a =>
> Nested(a,Seq.fill[Point](250000)(Point(1,2)))), 100)
> test.sortBy(_.a).take(1)
> {code}
> For both code snippets we started the spark shell with exactly the same
> arguments:
> {code:java}
> spark-shell --driver-memory 6G --conf "spark.driver.maxResultSize=100MB"
> {code}
> Even if we increase the spark.driver.maxResultSize, the executors still get
> killed for our use case. The interesting thing is that when using the RDD API
> directly the problem is not there. *Looks like there is a bug in dataframe
> sort because is shuffling too much data to the driver?*
> Note: this is a small example and I reduced the spark.driver.maxResultSize to
> a smaller size, but in our application I've tried setting it to 8GB but as
> mentioned above the job was killed.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]