tgravescs commented on pull request #27943: URL: https://github.com/apache/spark/pull/27943#issuecomment-637716923
I was running with spark.blacklist.application.fetchFailure.enabled in production at Yahoo, but our default timeouts and retries were much higher. It worked well for us. We had also configured the various configs for spark.reducer.maxBlocksInFlightPerAddres and trying to throttle the large jobs from overwhelming the node managers though too. The scenario you mention I think could benefit from that because the node manager is dead and isn't coming back. You can tie this with spark.blacklist.killBlacklistedExecutors. But agree with you, I definitely suggest testing and rolling out rather then throwing into production. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org