ankuriitg opened a new pull request #24208: [SPARK-27272][CORE] Enable 
blacklisting of node/executor on fetch failures by default
URL: https://github.com/apache/spark/pull/24208
 
 
   ## What changes were proposed in this pull request?
   
   SPARK-20898 added a new configuration to blacklist a node/executor on fetch
   failures. This config was deemed risky at the time and was disabled by 
default
   until more data is collected.
   
   This commit aims to enable that feature by default as we have seen couple of
   instances where that feature was found to be useful. Additionally, the commit
   changes the blacklist criteria slightly. The commit will blacklist the 
executor
   immediately (on first fetch failure) if external shuffle service is not 
enabled. This is
   consistent with the fact that we delete all the shuffle outputs on that
   executor. So, I think it is useful that we also blacklist that executor
   temporarily.
   
   Additionally, if external shuffle service is enabled, instead of 
blacklisting the
   node immediately, it keeps track of all such fetch failures on that node. If
   unique and active fetch failures on that node exceed the configured threshold
   (it re-uses MAX_FAILED_EXEC_PER_NODE, but can be changed), then the node is 
also
   blacklisted. This will ensure that persistent issues with a node do not lead 
to
   job failures.
   
   ## How was this patch tested?
   
   1. Added a unit test case to ensure that blacklisting works as configured

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to