[ 
https://issues.apache.org/jira/browse/SPARK-30602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17043829#comment-17043829
 ] 

Min Shen commented on SPARK-30602:
----------------------------------

[~shanyu], I have listed a few key differences between Riffle and this approach 
in my comment 
(http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-tt28732.html#a28734)
 in the dev mailing list. They are summarized below:

1. The merge ratio in Riffle might not be high enough, depending on the avg # 
of mapper tasks per node. 
2. It does not deliver the shuffle partition data to the reducers. So data 
locality for reducer tasks is not improved. Most of the reducer task input 
still needs to be fetched remotely, incurring RPC overhead and potential 
connection establishment failures.
3. More importantly, as illustrated in the Rifle paper, the local merge is 
performed by the shuffle service since it needs to read multiple mappers' 
output. This means the memory buffering of shuffle blocks to improve disk I/O 
is happening on the shuffle service side. While our approach also does memory 
buffering, we are doing this on the executor side, which makes it much less 
constraint compared with doing this inside shuffle service. This helps to 
improve the scalability of the solution, since shuffle service is a shared 
service in most cluster setups we know so far.

> SPIP: Support push-based shuffle to improve shuffle efficiency
> --------------------------------------------------------------
>
>                 Key: SPARK-30602
>                 URL: https://issues.apache.org/jira/browse/SPARK-30602
>             Project: Spark
>          Issue Type: Improvement
>          Components: Shuffle
>    Affects Versions: 3.1.0
>            Reporter: Min Shen
>            Priority: Major
>
> In a large deployment of a Spark compute infrastructure, Spark shuffle is 
> becoming a potential scaling bottleneck and a source of inefficiency in the 
> cluster. When doing Spark on YARN for a large-scale deployment, people 
> usually enable Spark external shuffle service and store the intermediate 
> shuffle files on HDD. Because the number of blocks generated for a particular 
> shuffle grows quadratically compared to the size of shuffled data (# mappers 
> and reducers grows linearly with the size of shuffled data, but # blocks is # 
> mappers * # reducers), one general trend we have observed is that the more 
> data a Spark application processes, the smaller the block size becomes. In a 
> few production clusters we have seen, the average shuffle block size is only 
> 10s of KBs. Because of the inefficiency of performing random reads on HDD for 
> small amount of data, the overall efficiency of the Spark external shuffle 
> services serving the shuffle blocks degrades as we see an increasing # of 
> Spark applications processing an increasing amount of data. In addition, 
> because Spark external shuffle service is a shared service in a multi-tenancy 
> cluster, the inefficiency with one Spark application could propagate to other 
> applications as well.
> In this ticket, we propose a solution to improve Spark shuffle efficiency in 
> above mentioned environments with push-based shuffle. With push-based 
> shuffle, shuffle is performed at the end of mappers and blocks get pre-merged 
> and move towards reducers. In our prototype implementation, we have seen 
> significant efficiency improvements when performing large shuffles. We take a 
> Spark-native approach to achieve this, i.e., extending Spark’s existing 
> shuffle netty protocol, and the behaviors of Spark mappers, reducers and 
> drivers. This way, we can bring the benefits of more efficient shuffle in 
> Spark without incurring the dependency or overhead of either specialized 
> storage layer or external infrastructure pieces.
>  
> Link to dev mailing list discussion: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to