Cheng Pei created SPARK-14293:
---------------------------------
Summary: Improve shuffle load balancing and minmize network data
transmission
Key: SPARK-14293
URL: https://issues.apache.org/jira/browse/SPARK-14293
Project: Spark
Issue Type: Improvement
Components: Spark Core
Affects Versions: 1.6.1
Reporter: Cheng Pei
Fix For: 2.0.0
Currently Spark provides the mechanism to set preferred location for reduce
task. When the fraction of total map output at a location is equal or greater
than the parameter REDUCER_PREF_LOCS_FRACTION, the reduce task will get a
preferred location. But this does not consider load balancing and network
transmission. Based on the map output sizes and locations tracked by
MapOutputTracker, we can obtain a better load balancing, which has been tested
in the some real applications, such as wordcount and pagerank.
This patch proposes a strategy to set preferred locations for each reduce task,
which could firstly keep each executor process almost the same amount of
intermediate data and secondly minimize the network data transmission. This can
benefit some conditions:
1. REDUCER_PREF_LOCS_FRACTION tries to place the reduce tasks close to the
largest output. If there exists data skew in the map outputs. It could cause
some executors that have large of map outputs become busy. Our method could
avoid this case and minimize the network data transmission.
2. When there are large of reduce tasks in the job, it helps each executor
processes almost the same data and keeps load balancing.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]