Baohe Zhang created SPARK-35865:
-----------------------------------

             Summary: Remove await (syncMode) in ChunkFetchRequestHandler
                 Key: SPARK-35865
                 URL: https://issues.apache.org/jira/browse/SPARK-35865
             Project: Spark
          Issue Type: Improvement
          Components: Shuffle
    Affects Versions: 3.1.2, 2.4.8
            Reporter: Baohe Zhang
         Attachments: openblock-compare.png, openblock.png

SPARK-24355 introduces syncMode to mitigate the issue of sasl timeout by 
throting the max number of threads for sending responses of chunk fetch 
requests. But it causes severe performance degradation because the throughput 
of handling chunk fetch requests is reduced. SPARK-30623 makes the async and 
sync mode configurable and makes the async mode the default. 

SPARK-30512 uses a dedicated boss event loop to mitigate the sasl timeout issue 
and we rarely see sasl timeout issues with async mode in our production 
clusters today. 

Few days ago we accidentally turned on sync mode on one cluster and we observed 
severe shuffle performance degradation. As a result, We benchmarked the 
performance comparison between async and sync mode and *we suggest removing 
sync mode in the code base* as it seems not to provide any benefits today. We 
would like to share the benchmark result and hear the opinion from the 
community.

 

benchmark on job's run time (sync mode is 2x - 3x slower):
YARN cluster setup: 6 nodes, 18 executors, each executor has 1 core and 3 GB 
memory, each node manager has 1GB heap size.

shuffle stages: 5GB shuffle data (400M key-value records), 1000 map tasks and 
1000 reduce tasks.

results: shuffle read 5GB data, async mode takes 2-3 mins and sync mode takes 6 
mins.

 

benchmark on metrics of external shuffle service:
YARN cluster setup: 4 nodes in total. I set 2 nodes as async mode and 2 nodes 
as sync mode, shuffling 2.5 GB data.

results: in openblockreuqestslatencymillis_ratemean and some other metrics, the 
nodes in sync mode are 3x - 4x higher than nodes in async mode. I attached some 
screenshots of the metrics.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to