tgravescs commented on pull request #27943:
URL: https://github.com/apache/spark/pull/27943#issuecomment-637716923


   I was running with spark.blacklist.application.fetchFailure.enabled in 
production at Yahoo, but our default timeouts and retries were much higher. It 
worked well for us.  We had also configured the various configs for 
spark.reducer.maxBlocksInFlightPerAddres and trying to throttle the large jobs 
from overwhelming the node managers though too. The scenario you mention I 
think could benefit from that because the node manager is dead and isn't coming 
back. You can tie this with  spark.blacklist.killBlacklistedExecutors. But 
agree with you, I definitely suggest testing and rolling out rather then 
throwing into production. 
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to