xkrogen commented on pull request #33446:
URL: https://github.com/apache/spark/pull/33446#issuecomment-887672286


   > We can get quite a few false positives
   I think false positives are okay -- if everything went quickly, you probably 
won't go digging for these logs. Even if things were fast overall, it could be 
useful to see which were the slowest.
   
   > We will need a reasonable number of shuffles on an executor before we can 
even calculate 5%ile. We cannot deduce that a fetch was slow when the fetch 
happened (at least for the first N shuffles), but only at a later stage.
   
   I agree it's an issue that if the slow fetch happened at the beginning, we 
wouldn't have enough info to determine if it's slow. I see a few options we can 
consider here:
   - Keep a history of all fetches in memory for the first N fetches (N = 100 
or 1000?). Once N is satisfied, create a histogram out of them, and log the 
slowest 5%. After that, for new fetches, just directly add to the histogram and 
check the percentile to determine if logging is needed.
   - Maintain the slowest N fetches (N = 10 or 100?), instead of trying to 
calculate a percentile. When all fetches are complete, just log all N slowest. 
This is simpler and avoids some issues with the above approach, e.g. if fetches 
are getting slower over time, the percentile-approach above will end up logging 
all of the fetches. Putting some absolute limit on the number might be better 
-- you're probably not going to manually investigating more than a dozen 
instances of slowness anyway.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to