[GitHub] [spark] cloud-fan commented on issue #22173: [SPARK-24355] Spark external shuffle server improvement to better handle block fetch requests.

GitBox Thu, 09 Jan 2020 01:00:22 -0800

cloud-fan commented on issue #22173: [SPARK-24355] Spark external shuffle 
server improvement to better handle block fetch requests.
URL: https://github.com/apache/spark/pull/22173#issuecomment-572459561
 
 
   Unfortunately, I'm not able to minimize our internal workload, so I switch 
to TPCDS to show the perf regression.
   
   data: TPCDS table `store_sales` with scale factor 99. It's 3.5GB, 1233 files
   query: `sql("select count(distinct ss_list_price) from store_sales where 
ss_quantity == 5").show`
   spark: latest master, "local-cluster[2, 4, 19968]"
   env: m4-4xlarge
   
   Since it's too many changes to revert this commit, I simply remove the 
`await` in `ChunkFetchRequestHandler`, which effectively reverts this feature.
   
   With `await` removed, the query runs 4% faster, which is not much. But if 
you look at the web UI and check the task metrics, shuffle read time is 
significantly reduced if we remove `await`.
   
   The master branch:
   
![image](https://user-images.githubusercontent.com/3182036/72052885-f0ba9e00-3300-11ea-94c5-38ba5316218c.png)
   and the second stage
   
![image](https://user-images.githubusercontent.com/3182036/72052947-0cbe3f80-3301-11ea-8b0b-e512b6b94bb1.png)
   
   With `await` removed:
   
![image](https://user-images.githubusercontent.com/3182036/72052981-1e074c00-3301-11ea-8171-274cbec017b2.png)
   and the second stage
   
![image](https://user-images.githubusercontent.com/3182036/72053024-31b2b280-3301-11ea-844e-476cc9e15d2a.png)
   
   The shuffle read is about 3x faster with `await` removed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] cloud-fan commented on issue #22173: [SPARK-24355] Spark external shuffle server improvement to better handle block fetch requests.

Reply via email to