tgravescs commented on issue #24645: [SPARK-27773][Shuffle] add metrics for 
number of exceptions caught in ExternalShuffleBlockHandler
URL: https://github.com/apache/spark/pull/24645#issuecomment-496593324
 
 
   Its kind of weird that your client (executor) is saying it can't connect and 
your nodemanager is saying can't send.  I'm assuming the can't send is to 
another executor (not the one with the error about not connecting).
   
   Are these errors you  sometimes see after the job is complete or when its 
actively running.  Sounds like the host is possibly overloaded. I would check 
to see the node health (disk usage, network usage, etc), see if the  NM was 
Gc'ing or if all the threads are busy processing chunked fetch requests.  Note 
that https://issues.apache.org/jira/browse/SPARK-24355 added in separate 
threads for handling of non-chunked fetch requests.
   
   I'm in general fine with adding more metrics if they are useful, I'm just 
not sure how actionable this one is.  If the shuffle service can't send it to a 
remote host then its not necessarily that particular nodemanager/shuffle 
service that is bad, it could be an application acting badly or GC'ing on 
executor side and things timing out.
   
   Was your plan just to report the metric and then watch at the cluster level 
for these?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to