[jira] [Commented] (SPARK-9310) Spark shuffle performance degrades significantly with an increased number of tasks

Jem Tucker (JIRA) Fri, 24 Jul 2015 06:31:30 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-9310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14640450#comment-14640450
 ]


Jem Tucker commented on SPARK-9310:
-----------------------------------

The majority of my operations are joins and aggregations by key 
(groupByKey/reduceByKey). 

It does make sense that the shuffle overhead is the cause of the issue I am 
seeing. If I understand this correctly, it seems like for very large stages 
this overhead will always become a significant proportion of the run time? 
Apart from increasing the size of the partitions, which will be limited by the 
amount of memory on each node (64G on my nodes), is there anything that I can 
do to reduce this overhead?

Also is this something that has been addressed in more recent releases, as 
there would be the potential to patch our implementation?


> Spark shuffle performance degrades significantly with an increased number of 
> tasks
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-9310
>                 URL: https://issues.apache.org/jira/browse/SPARK-9310
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle
>    Affects Versions: 1.2.0
>         Environment: 2 node cluster - CDH 5.3.2 on CentOS 
>            Reporter: Jem Tucker
>              Labels: performance
>
> When running a large number of complex stages on high volumes of data shuffle 
> duration increased by a factor of 3 when the parallelism was increased by a 
> factor of 5 from 2000 to 10000. 
> In both cases tasks run for over a minute (to process approximately 2MB of 
> data with initial parallelisation) so I ruled out any task overhead that 
> could be causing this.
> Monitoring IO and network traffic showed that neither were at more than 10% 
> of their potential max during shuffles and CPU utilization seemed worryingly 
> low as well, neither are we experiencing a concerning level of garbage 
> collection.
> Is performance of shuffles expected to be so heavily influenced by the number 
> of tasks?  If so, is there an effective way to tune the number of partitions 
> at run-time for different inputs? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-9310) Spark shuffle performance degrades significantly with an increased number of tasks

Reply via email to