[
https://issues.apache.org/jira/browse/SPARK-27773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Apache Spark reassigned SPARK-27773:
------------------------------------
Assignee: (was: Apache Spark)
> Add shuffle service metric for number of exceptions caught in
> TransportChannelHandler
> -------------------------------------------------------------------------------------
>
> Key: SPARK-27773
> URL: https://issues.apache.org/jira/browse/SPARK-27773
> Project: Spark
> Issue Type: Improvement
> Components: Shuffle
> Affects Versions: 2.4.3
> Reporter: Steven Rand
> Priority: Minor
>
> The health of the external shuffle service is currently difficult to monitor.
> At least for the YARN shuffle service, the only current indication of health
> is whether or not the shuffle service threads are running in the NodeManager.
> However, we've seen that clients can sometimes experience elevated failure
> rates on requests to the shuffle service even when those threads are running.
> It would be helpful to have some indication of how often requests to the
> shuffle service are failing, as this could be monitored, alerted on, etc.
> One suggestion (implemented in the PR I'll attach to this ticket) is to add a
> metric to {{ExternalShuffleBlockHandler.ShuffleMetrics}} which keeps track of
> how many times we called {{TransportChannelHandler#exceptionCaught}}. I think
> that this gives us the insight into request failure rates that we're
> currently missing, but obviously I'm open to alternatives as well if people
> have other ideas.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]