[ 
https://issues.apache.org/jira/browse/SPARK-9310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14640414#comment-14640414
 ] 

Sean Owen commented on SPARK-9310:
----------------------------------

What are the essential operations you're running? that might help, for example, 
in identifying how to optimize the operation.

But if you have 5x more tasks, you have 25x more potential task-task 
communications to complete the shuffle. that doesn't mean 25x more I/O, but 
does mean more overhead to manage the shuffle. It's not crazy that it might 
make the shuffle time 3x longer.

> Spark shuffle performance degrades significantly with an increased number of 
> tasks
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-9310
>                 URL: https://issues.apache.org/jira/browse/SPARK-9310
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle
>    Affects Versions: 1.2.0
>         Environment: 2 node cluster - CDH 5.3.2 on CentOS 
>            Reporter: Jem Tucker
>              Labels: performance
>
> When running a large number of complex stages on high volumes of data shuffle 
> duration increased by a factor of 3 when the parallelism was increased by a 
> factor of 5 from 2000 to 10000. 
> In both cases tasks run for over a minute (to process approximately 2MB of 
> data with initial parallelisation) so I ruled out any task overhead that 
> could be causing this.
> Monitoring IO and network traffic showed that neither were at more than 10% 
> of their potential max during shuffles and CPU utilization seemed worryingly 
> low as well, neither are we experiencing a concerning level of garbage 
> collection.
> Is performance of shuffles expected to be so heavily influenced by the number 
> of tasks?  If so, is there an effective way to tune the number of partitions 
> at run-time for different inputs? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to