Deependra-Patel opened a new pull request, #42071: URL: https://github.com/apache/spark/pull/42071
This will be available as external shuffle service metric ### What changes were proposed in this pull request? Adding three more metrics to ShuffleMetrics (exposed by External Shuffle Service) "totalShuffleDataBytes", "numAppsWithShuffleData" and "lastNodeShuffleMetricRefreshEpochMillis" We add a new daemon that scans the disk every 30s (configurable) and sums up total size of the shuffle data. ### Why are the changes needed? Adding these metrics would help in - 1. Deciding if we can decommission the node if no shuffle data present 2. Better live monitoring of customer's workload to see if there is skewed shuffle data present on the node ### Does this PR introduce _any_ user-facing change? This documentation will need to be updated on next release: https://spark.apache.org/docs/latest/monitoring.html#component-instance--shuffleservice ### How was this patch tested? UTs are added Also tested manually on a YARN cluster and metrics are getting published -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
