[
https://issues.apache.org/jira/browse/SPARK-9310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14640486#comment-14640486
]
Sean Owen commented on SPARK-9310:
----------------------------------
Shuffling is frequently the slowest part of a computation. A lot of small
partitions may make it worse, but it depends. Do you know if you have enabled
the sort-based shuffle? I think that was *not* the default in CDH 5.3, but I
forget ([~sandyr] and or [~vanzin] would know). If not, that could help.
Yes, plenty has changed since 1.2 that would improve the shuffle. I can't say
for sure whether it would change what you see. In your setup if you can try CDH
5.4 for example, that would get you 1.3.x. If you're able to build Spark and do
a bit of setup, you can likely get 1.4 running on 5.3 too.
Another approach is to try to design to avoid shuffles. Or at least, shuffle
less data. I see that I/O isn't the bottleneck, but it may still help. Or try
to design to not need so many small partitions in the shuffle stage?
> Spark shuffle performance degrades significantly with an increased number of
> tasks
> ----------------------------------------------------------------------------------
>
> Key: SPARK-9310
> URL: https://issues.apache.org/jira/browse/SPARK-9310
> Project: Spark
> Issue Type: Bug
> Components: Shuffle
> Affects Versions: 1.2.0
> Environment: 2 node cluster - CDH 5.3.2 on CentOS
> Reporter: Jem Tucker
> Labels: performance
>
> When running a large number of complex stages on high volumes of data shuffle
> duration increased by a factor of 3 when the parallelism was increased by a
> factor of 5 from 2000 to 10000.
> In both cases tasks run for over a minute (to process approximately 2MB of
> data with initial parallelisation) so I ruled out any task overhead that
> could be causing this.
> Monitoring IO and network traffic showed that neither were at more than 10%
> of their potential max during shuffles and CPU utilization seemed worryingly
> low as well, neither are we experiencing a concerning level of garbage
> collection.
> Is performance of shuffles expected to be so heavily influenced by the number
> of tasks? If so, is there an effective way to tune the number of partitions
> at run-time for different inputs?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]