Jem Tucker created SPARK-9310:
---------------------------------

             Summary: Spark shuffle performance degrades significantly with an 
increased number of tasks
                 Key: SPARK-9310
                 URL: https://issues.apache.org/jira/browse/SPARK-9310
             Project: Spark
          Issue Type: Bug
          Components: Shuffle
    Affects Versions: 1.2.0
         Environment: 2 node cluster - CDH 5.3.2 on CentOS 
            Reporter: Jem Tucker


When running a large number of complex stages on high volumes of data shuffle 
duration increased by a factor of 3 when the parallelism was increased by a 
factor of 5 from 2000 to 10000. 

In both cases tasks run for over a minute (to process approximately 2MB of data 
with initial parallelisation) so I ruled out any task overhead that could be 
causing this.

Monitoring IO and network traffic showed that neither were at more than 10% of 
their potential max during shuffles and CPU utilization seemed worryingly low 
as well, neither are we experiencing a concerning level of garbage collection.

Is performance of shuffles expected to be so heavily influenced by the number 
of tasks?  If so, is there an effective way to tune the number of partitions at 
run-time for different inputs? 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to