Github user jinxing64 commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    @squito
    Thanks a lot for taking time looking into this pr.
    I updated the pr. Currently just add two metrics: a) the total size of 
underestimated blocks size, b) the size of blocks shuffled to memory.
    For a), executor use `maxBytesInFlight` to control the speed of 
shuffle-read. I agree with your comment `another metric that may be nice to 
capture here is maximum underestimate`. But think about this scenario: the 
maximum is small, but thousands of blocks are underestimated, thus 
`maxBytesInFlight` cannot help avoid the OOM during shuffle-read. That's why I 
proposed to track the metrics of total size of underestimated blocks size;
    For b), currently all data are shuffled-read to memory. If we add the 
feature of shuffling to disk when memory shortage, we need to evaluate the 
performance. I think another two metrics need to be taken into account: the 
size of blocks shuffled to disk(to be added in another pr) and task's running 
time(already exist). The more data shuffled to memory, the better performance; 
The shorter time cost, the better performance.
    
    I also added some log for debug in `ShuffleWriter`, including the num of 
underestimated blocks and the size distribution.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to