Jem Tucker created SPARK-9310:
---------------------------------
Summary: Spark shuffle performance degrades significantly with an
increased number of tasks
Key: SPARK-9310
URL: https://issues.apache.org/jira/browse/SPARK-9310
Project: Spark
Issue Type: Bug
Components: Shuffle
Affects Versions: 1.2.0
Environment: 2 node cluster - CDH 5.3.2 on CentOS
Reporter: Jem Tucker
When running a large number of complex stages on high volumes of data shuffle
duration increased by a factor of 3 when the parallelism was increased by a
factor of 5 from 2000 to 10000.
In both cases tasks run for over a minute (to process approximately 2MB of data
with initial parallelisation) so I ruled out any task overhead that could be
causing this.
Monitoring IO and network traffic showed that neither were at more than 10% of
their potential max during shuffles and CPU utilization seemed worryingly low
as well, neither are we experiencing a concerning level of garbage collection.
Is performance of shuffles expected to be so heavily influenced by the number
of tasks? If so, is there an effective way to tune the number of partitions at
run-time for different inputs?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]