xkrogen opened a new pull request #32388:
URL: https://github.com/apache/spark/pull/32388


   ### What changes were proposed in this pull request?
   This adds two new additional metrics to `ExternalBlockHandler`:
   - `blockTransferRate` -- for indicating the rate of transferring blocks, vs. 
the data within them
   - `blockTransferAvgSize_1min` -- a 1-minute trailing average of block sizes 
transferred by the ESS
   
   Additionally, this enhances `YarnShuffleServiceMetrics` to expose the 
histogram/`Snapshot` information from `Timer` metrics within 
`ExternalBlockHandler`.
   
   ### Why are the changes needed?
   Currently `ExternalBlockHandler` exposes some useful metrics, but is lacking 
around metrics for the rate of block transfers. We have 
`blockTransferRateBytes` to tell us the rate of _bytes_, but no metric to tell 
us the rate of _blocks_, which is especially relevant when running the ESS on 
HDDs that are sensitive to random reads. Many small block transfers can have a 
negative impact on performance, but won't show up as a spike in 
`blockTransferRateBytes` since the sizes are small. Thus the new metrics to 
show information around average block size and block transfer rate are very 
useful to monitor the health/performance of the ESS, especially when running on 
HDDs. 
   
   For the `YarnShuffleServiceMetrics`, currently the three `Timer` metrics 
exposed by `ExternalBlockHandler` are being underutilized in a YARN-based 
environment -- they are basically treated as a `Meter`, only exposing 
rate-based information, when the metrics themselves are collected detailed 
histograms of timing information. We should expose this information for better 
observability.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, there are two entirely new metrics for the ESS, as documented in 
`monitoring.md`. Additionally in a YARN environment, `Timer` metrics exposed by 
the ESS will include more rich timing information.
   
   ### How was this patch tested?
   New unit tests are added to verify that new metrics are showing up as 
expected.
   
   We have been running this patch internally for approx. 1 year and have found 
it to be useful for monitoring the health of ESS and diagnosing performance 
issues.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to