dzypersonal commented on PR #36162:
URL: https://github.com/apache/spark/pull/36162#issuecomment-1685522302

   > It helps in two cases @weixiuli - the example you gave (generated input 
(like range()), etc where there is no input metrics). It also helps when 
reading shuffle input where there is a sort - the entire shuffle input will get 
consumed at beginning of the task, but the output rate would be impacted by the 
subsequent computation/skew/etc in the task (or even output writes from the 
stage).
   
   That makes sense. I got a data skew task as follows:
   
![企业微信截图_80ae77cb-4647-49ce-a3e5-6c6c78104d09](https://github.com/apache/spark/assets/39691337/309028a1-1e33-404a-80b0-186a8aafc5b1)
   
![企业微信截图_91a84b56-66c3-4a52-8bdd-bbc779a743a0](https://github.com/apache/spark/assets/39691337/a0325a5f-eab3-4e85-a415-7b69ece9528e)
   
   Median shuffle read records process rate is probably 25507 / 5 = 5101.4, and 
shuffle write records process rate is probably 399365 / 5 = 79873.
   Skew task index 42 is marked as speculatable due to its shuffle read records 
process rate is probably 15606048 / 5400 = 2890, it seems ineffecient. But if 
we calculate its shuffle write records process rate, that probably is 499186709 
/ 5400 = 92441.983, it is larger than median one 79873.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to