Steven Rand created SPARK-27773:
-----------------------------------
Summary: Add shuffle service metric for number of exceptions
caught in TransportChannelHandler
Key: SPARK-27773
URL: https://issues.apache.org/jira/browse/SPARK-27773
Project: Spark
Issue Type: Improvement
Components: Shuffle
Affects Versions: 2.4.3
Reporter: Steven Rand
The health of the external shuffle service is currently difficult to monitor.
At least for the YARN shuffle service, the only current indication of health is
whether or not the shuffle service threads are running in the NodeManager.
However, we've seen that clients can sometimes experience elevated failure
rates on requests to the shuffle service even when those threads are running.
It would be helpful to have some indication of how often requests to the
shuffle service are failing, as this could be monitored, alerted on, etc.
One suggestion (implemented in the PR I'll attach to this ticket) is to add a
metric to {{ExternalShuffleBlockHandler.ShuffleMetrics}} which keeps track of
how many times we called {{TransportChannelHandler#exceptionCaught}}. I think
that this gives us the insight into request failure rates that we're currently
missing, but obviously I'm open to alternatives as well if people have other
ideas.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]