[
https://issues.apache.org/jira/browse/SPARK-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114092#comment-14114092
]
Josh Rosen commented on SPARK-3280:
-----------------------------------
Here are some numbers from August 10. If I recall, this was running on 8
m3.8xlarge nodes. This test linearly scales a bunch of parameters (data set
size, numbers of mappers and reducers, etc). You can see that hash-based
shuffle's performance degrades severely in cases where we have many mappers and
reducers, while sort scales much more gracefully:
!http://i.imgur.com/rODzaG1.png!
!http://i.imgur.com/72kCkH5.png!
This was run with spark-perf; here's a sample config for one of the bars:
{code}
Java options: -Dspark.storage.memoryFraction=0.66
-Dspark.serializer=org.apache.spark.serializer.JavaSerializer
-Dspark.locality.wait=60000000
-Dspark.shuffle.manager=org.apache.spark.shuffle.hash.HashShuffleManager
Options: aggregate-by-key-naive --num-trials=10 --inter-trial-wait=3
--num-partitions=400 --reduce-tasks=400 --random-seed=5
--persistent-type=memory --num-records=200000000 --unique-keys=20000
--key-length=10 --unique-values=1000000 --value-length=10
--storage-location=hdfs://:9000/spark-perf-kv-data
{code}
I'll try to run a better set of tests today. I plan to look at a few cases
that these tests didn't address, including the performance impact when running
on spinning disks, as well as jobs where we have a large dataset with few
mappers and reducers (I think this is the case that we'd expect to be most
favorable to hash-based shuffle).
> Made sort-based shuffle the default implementation
> --------------------------------------------------
>
> Key: SPARK-3280
> URL: https://issues.apache.org/jira/browse/SPARK-3280
> Project: Spark
> Issue Type: Improvement
> Reporter: Reynold Xin
> Assignee: Reynold Xin
>
> sort-based shuffle has lower memory usage and seems to outperform hash-based
> in almost all of our testing.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]