Github user jinxing64 commented on the issue:
https://github.com/apache/spark/pull/17276
@squito
Thanks a lot for taking time looking into this pr.
I updated the pr. Currently just add two metrics: a) the total size of
underestimated blocks size, b) the size of blocks shuffled to memory.
For a), executor use `maxBytesInFlight` to control the speed of
shuffle-read. I agree with your comment `another metric that may be nice to
capture here is maximum underestimate`. But think about this scenario: the
maximum is small, but thousands of blocks are underestimated, thus
`maxBytesInFlight` cannot help avoid the OOM during shuffle-read. That's why I
proposed to track the metrics of total size of underestimated blocks size;
For b), currently all data are shuffled-read to memory. If we add the
feature of shuffling to disk when memory shortage, we need to evaluate the
performance. I think another two metrics need to be taken into account: the
size of blocks shuffled to disk(to be added in another pr) and task's running
time(already exist). The more data shuffled to memory, the better performance;
The shorter time cost, the better performance.
I also added some log for debug in `ShuffleWriter`, including the num of
underestimated blocks and the size distribution.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]