[ 
https://issues.apache.org/jira/browse/SPARK-14293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14293:
------------------------------------

    Assignee:     (was: Apache Spark)

> Improve shuffle load balancing and minmize network data transmission
> --------------------------------------------------------------------
>
>                 Key: SPARK-14293
>                 URL: https://issues.apache.org/jira/browse/SPARK-14293
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 1.6.1
>            Reporter: Cheng Pei
>              Labels: performance
>
> Currently Spark provides the mechanism to set preferred location for reduce 
> task. When the fraction of total map output at a location is equal or greater 
> than the  parameter REDUCER_PREF_LOCS_FRACTION, the reduce task will get a 
> preferred location. But this does not consider load balancing and network 
> transmission. Based on the map output sizes and locations tracked by 
> MapOutputTracker, we can obtain a better load balancing.
>  
> This patch proposes a strategy to set preferred locations for each reduce 
> task, which could firstly keep each executor process almost the same amount 
> of intermediate data and secondly minimize the network data transmission. 
> This can benefit some conditions:
> 1.    REDUCER_PREF_LOCS_FRACTION tries to place the reduce tasks close to the 
> largest output. If there exists data skew in the map outputs, it could cause 
> some executors that have large of map outputs become busy. Our method could 
> avoid this case and minimize the network data transmission. 
> 2.    When there are large of reduce tasks in the job, it helps each executor 
> processes almost the same data and keeps load balancing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to