ankuriitg opened a new pull request #24208: [SPARK-27272][CORE] Enable blacklisting of node/executor on fetch failures by default URL: https://github.com/apache/spark/pull/24208 ## What changes were proposed in this pull request? SPARK-20898 added a new configuration to blacklist a node/executor on fetch failures. This config was deemed risky at the time and was disabled by default until more data is collected. This commit aims to enable that feature by default as we have seen couple of instances where that feature was found to be useful. Additionally, the commit changes the blacklist criteria slightly. The commit will blacklist the executor immediately (on first fetch failure) if external shuffle service is not enabled. This is consistent with the fact that we delete all the shuffle outputs on that executor. So, I think it is useful that we also blacklist that executor temporarily. Additionally, if external shuffle service is enabled, instead of blacklisting the node immediately, it keeps track of all such fetch failures on that node. If unique and active fetch failures on that node exceed the configured threshold (it re-uses MAX_FAILED_EXEC_PER_NODE, but can be changed), then the node is also blacklisted. This will ensure that persistent issues with a node do not lead to job failures. ## How was this patch tested? 1. Added a unit test case to ensure that blacklisting works as configured
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
