[ 
https://issues.apache.org/jira/browse/SPARK-35865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baohe Zhang updated SPARK-35865:
--------------------------------
    Attachment: openblock.png

> Remove await (syncMode) in ChunkFetchRequestHandler
> ---------------------------------------------------
>
>                 Key: SPARK-35865
>                 URL: https://issues.apache.org/jira/browse/SPARK-35865
>             Project: Spark
>          Issue Type: Improvement
>          Components: Shuffle
>    Affects Versions: 2.4.8, 3.1.2
>            Reporter: Baohe Zhang
>            Priority: Major
>         Attachments: openblock-compare.png, openblock.png
>
>
> SPARK-24355 introduces syncMode to mitigate the issue of sasl timeout by 
> throting the max number of threads for sending responses of chunk fetch 
> requests. But it causes severe performance degradation because the throughput 
> of handling chunk fetch requests is reduced. SPARK-30623 makes the async and 
> sync mode configurable and makes the async mode the default. 
> SPARK-30512 uses a dedicated boss event loop to mitigate the sasl timeout 
> issue and we rarely see sasl timeout issues with async mode in our production 
> clusters today. 
> Few days ago we accidentally turned on sync mode on one cluster and we 
> observed severe shuffle performance degradation. As a result, We benchmarked 
> the performance comparison between async and sync mode and *we suggest 
> removing sync mode in the code base* as it seems not to provide any benefits 
> today. We would like to share the benchmark result and hear the opinion from 
> the community.
>  
> benchmark on job's run time (sync mode is 2x - 3x slower):
> YARN cluster setup: 6 nodes, 18 executors, each executor has 1 core and 3 GB 
> memory, each node manager has 1GB heap size.
> shuffle stages: 5GB shuffle data (400M key-value records), 1000 map tasks and 
> 1000 reduce tasks.
> results: shuffle read 5GB data, async mode takes 2-3 mins and sync mode takes 
> 6 mins.
>  
> benchmark on metrics of external shuffle service:
> YARN cluster setup: 4 nodes in total. I set 2 nodes as async mode and 2 nodes 
> as sync mode, shuffling 2.5 GB data.
> results: in openblockreuqestslatencymillis_ratemean and some other metrics, 
> the nodes in sync mode are 3x - 4x higher than nodes in async mode. I 
> attached some screenshots of the metrics.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to