Steven Rand created SPARK-27773:
-----------------------------------

             Summary: Add shuffle service metric for number of exceptions 
caught in TransportChannelHandler
                 Key: SPARK-27773
                 URL: https://issues.apache.org/jira/browse/SPARK-27773
             Project: Spark
          Issue Type: Improvement
          Components: Shuffle
    Affects Versions: 2.4.3
            Reporter: Steven Rand


The health of the external shuffle service is currently difficult to monitor. 
At least for the YARN shuffle service, the only current indication of health is 
whether or not the shuffle service threads are running in the NodeManager. 
However, we've seen that clients can sometimes experience elevated failure 
rates on requests to the shuffle service even when those threads are running. 
It would be helpful to have some indication of how often requests to the 
shuffle service are failing, as this could be monitored, alerted on, etc.

One suggestion (implemented in the PR I'll attach to this ticket) is to add a 
metric to {{ExternalShuffleBlockHandler.ShuffleMetrics}} which keeps track of 
how many times we called {{TransportChannelHandler#exceptionCaught}}. I think 
that this gives us the insight into request failure rates that we're currently 
missing, but obviously I'm open to alternatives as well if people have other 
ideas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to