[ 
https://issues.apache.org/jira/browse/SPARK-9310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14640556#comment-14640556
 ] 

Jem Tucker commented on SPARK-9310:
-----------------------------------

I will try with sort-based shuffle enabled, with regards to designing shuffles 
out of the code, I have been trying to minimise this but will keep looking to 
alternative processes. I have been looking at ways of accurately partitioning 
my RDDs on the fly so maybe this will be the key to tuning these shuffles.  

Is it worth suggesting that the issues discussed regarding numbers of tasks and 
shuffle performance here are mentioned in the 'Tuning Spark' documentation? I 
feel that a more detailed section on shuffle tuning would be useful. 

Thanks for the suggestions, I will let you know how they perform.

> Spark shuffle performance degrades significantly with an increased number of 
> tasks
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-9310
>                 URL: https://issues.apache.org/jira/browse/SPARK-9310
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle
>    Affects Versions: 1.2.0
>         Environment: 2 node cluster - CDH 5.3.2 on CentOS 
>            Reporter: Jem Tucker
>              Labels: performance
>
> When running a large number of complex stages on high volumes of data shuffle 
> duration increased by a factor of 3 when the parallelism was increased by a 
> factor of 5 from 2000 to 10000. 
> In both cases tasks run for over a minute (to process approximately 2MB of 
> data with initial parallelisation) so I ruled out any task overhead that 
> could be causing this.
> Monitoring IO and network traffic showed that neither were at more than 10% 
> of their potential max during shuffles and CPU utilization seemed worryingly 
> low as well, neither are we experiencing a concerning level of garbage 
> collection.
> Is performance of shuffles expected to be so heavily influenced by the number 
> of tasks?  If so, is there an effective way to tune the number of partitions 
> at run-time for different inputs? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to