tgravescs commented on issue #24645: [SPARK-27773][Shuffle] add metrics for number of exceptions caught in ExternalShuffleBlockHandler URL: https://github.com/apache/spark/pull/24645#issuecomment-496593324 Its kind of weird that your client (executor) is saying it can't connect and your nodemanager is saying can't send. I'm assuming the can't send is to another executor (not the one with the error about not connecting). Are these errors you sometimes see after the job is complete or when its actively running. Sounds like the host is possibly overloaded. I would check to see the node health (disk usage, network usage, etc), see if the NM was Gc'ing or if all the threads are busy processing chunked fetch requests. Note that https://issues.apache.org/jira/browse/SPARK-24355 added in separate threads for handling of non-chunked fetch requests. I'm in general fine with adding more metrics if they are useful, I'm just not sure how actionable this one is. If the shuffle service can't send it to a remote host then its not necessarily that particular nodemanager/shuffle service that is bad, it could be an application acting badly or GC'ing on executor side and things timing out. Was your plan just to report the metric and then watch at the cluster level for these?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
