tgravescs commented on pull request #27943:
URL: https://github.com/apache/spark/pull/27943#issuecomment-637569238


   See config:
   spark.blacklist.application.fetchFailure.enabled 
(http://spark.apache.org/docs/latest/configuration.html)
   But here you need to be careful with this as well if its just an 
intermittent type failure where it would recover shortly from.
   
   ok, yeah your node manager are probably very busy and possibly disks are 
pegged with all the shuffle and different requests.   Some of that can be 
helped with the shuffle configs like spark.reducer.maxBlocksInFlightPerAddress 
but those aren't guaranteed to solve your problems, basically throttling 
heuristics.
   
   Yeah the issue was that the wait time before wasn't really what it was 
waiting, it could be waiting much longer.  I think it makes sense for you to 
increase the retryWait time and give it a try.  I know on certain jobs we run 
where it isn't necessarily all about speed but we really want it to finish no 
matter what we make those pretty high.
   
   
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to