tgravescs commented on pull request #27943: URL: https://github.com/apache/spark/pull/27943#issuecomment-637569238
See config: spark.blacklist.application.fetchFailure.enabled (http://spark.apache.org/docs/latest/configuration.html) But here you need to be careful with this as well if its just an intermittent type failure where it would recover shortly from. ok, yeah your node manager are probably very busy and possibly disks are pegged with all the shuffle and different requests. Some of that can be helped with the shuffle configs like spark.reducer.maxBlocksInFlightPerAddress but those aren't guaranteed to solve your problems, basically throttling heuristics. Yeah the issue was that the wait time before wasn't really what it was waiting, it could be waiting much longer. I think it makes sense for you to increase the retryWait time and give it a try. I know on certain jobs we run where it isn't necessarily all about speed but we really want it to finish no matter what we make those pretty high. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org