Cheng Pei created SPARK-14293:
---------------------------------

             Summary: Improve shuffle load balancing and minmize network data 
transmission
                 Key: SPARK-14293
                 URL: https://issues.apache.org/jira/browse/SPARK-14293
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 1.6.1
            Reporter: Cheng Pei
             Fix For: 2.0.0


Currently Spark provides the mechanism to set preferred location for reduce 
task. When the fraction of total map output at a location is equal or greater 
than the  parameter REDUCER_PREF_LOCS_FRACTION, the reduce task will get a 
preferred location. But this does not consider load balancing and network 
transmission. Based on the map output sizes and locations tracked by 
MapOutputTracker, we can obtain a better load balancing, which has been tested 
in the some real applications, such as wordcount and pagerank.
 
This patch proposes a strategy to set preferred locations for each reduce task, 
which could firstly keep each executor process almost the same amount of 
intermediate data and secondly minimize the network data transmission. This can 
benefit some conditions:
1.      REDUCER_PREF_LOCS_FRACTION tries to place the reduce tasks close to the 
largest output. If there exists data skew in the map outputs. It could cause 
some executors that have large of map outputs become busy. Our method could 
avoid this case and minimize the network data transmission. 
2.      When there are large of reduce tasks in the job, it helps each executor 
processes almost the same data and keeps load balancing.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to