xkrogen commented on pull request #33446: URL: https://github.com/apache/spark/pull/33446#issuecomment-887672286
> We can get quite a few false positives I think false positives are okay -- if everything went quickly, you probably won't go digging for these logs. Even if things were fast overall, it could be useful to see which were the slowest. > We will need a reasonable number of shuffles on an executor before we can even calculate 5%ile. We cannot deduce that a fetch was slow when the fetch happened (at least for the first N shuffles), but only at a later stage. I agree it's an issue that if the slow fetch happened at the beginning, we wouldn't have enough info to determine if it's slow. I see a few options we can consider here: - Keep a history of all fetches in memory for the first N fetches (N = 100 or 1000?). Once N is satisfied, create a histogram out of them, and log the slowest 5%. After that, for new fetches, just directly add to the histogram and check the percentile to determine if logging is needed. - Maintain the slowest N fetches (N = 10 or 100?), instead of trying to calculate a percentile. When all fetches are complete, just log all N slowest. This is simpler and avoids some issues with the above approach, e.g. if fetches are getting slower over time, the percentile-approach above will end up logging all of the fetches. Putting some absolute limit on the number might be better -- you're probably not going to manually investigating more than a dozen instances of slowness anyway. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
