[ 
https://issues.apache.org/jira/browse/SPARK-27773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-27773:
--------------------------------------

    Assignee: Steven Rand

> Add shuffle service metric for number of exceptions caught in 
> ExternalShuffleBlockHandler
> -----------------------------------------------------------------------------------------
>
>                 Key: SPARK-27773
>                 URL: https://issues.apache.org/jira/browse/SPARK-27773
>             Project: Spark
>          Issue Type: Improvement
>          Components: Shuffle
>    Affects Versions: 2.4.3
>            Reporter: Steven Rand
>            Assignee: Steven Rand
>            Priority: Minor
>
> The health of the external shuffle service is currently difficult to monitor. 
> At least for the YARN shuffle service, the only current indication of health 
> is whether or not the shuffle service threads are running in the NodeManager. 
> However, we've seen that clients can sometimes experience elevated failure 
> rates on requests to the shuffle service even when those threads are running. 
> It would be helpful to have some indication of how often requests to the 
> shuffle service are failing, as this could be monitored, alerted on, etc.
> One suggestion (implemented in the PR I'll attach to this ticket) is to add a 
> metric to {{ExternalShuffleBlockHandler.ShuffleMetrics}} which keeps track of 
> how many times we caught an exception in the shuffle service's RPC handler. I 
> think that this gives us the insight into request failure rates that we're 
> currently missing, but obviously I'm open to alternatives as well if people 
> have other ideas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to