dzypersonal commented on PR #36162: URL: https://github.com/apache/spark/pull/36162#issuecomment-1685522302
> It helps in two cases @weixiuli - the example you gave (generated input (like range()), etc where there is no input metrics). It also helps when reading shuffle input where there is a sort - the entire shuffle input will get consumed at beginning of the task, but the output rate would be impacted by the subsequent computation/skew/etc in the task (or even output writes from the stage). That makes sense. I got a data skew task as follows:   Median shuffle read records process rate is probably 25507 / 5 = 5101.4, and shuffle write records process rate is probably 399365 / 5 = 79873. Skew task index 42 is marked as speculatable due to its shuffle read records process rate is probably 15606048 / 5400 = 2890, it seems ineffecient. But if we calculate its shuffle write records process rate, that probably is 499186709 / 5400 = 92441.983, it is larger than median one 79873. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
