tgravescs commented on pull request #27943: URL: https://github.com/apache/spark/pull/27943#issuecomment-636930244
do you see the nodes being blacklisted or are there not enough failures to cause blacklisting? In general though it sounds like its GC or some other slowness and the node manager comes back fairly quickly? Can I ask what your settings are now and perhaps how long of GC pauses you are seeing? > I have one more question here, does the first task that failed to create the client and set the lastConnectionFailed will hit the fast fail logic when it retries later? No, its supposed to fail any attempts that happen after the first task failed up til 95% of the spark.shuffle.io.retryWait timeout. At that point it should let any create calls go through to try again. That means the first task that failed should be trying to connect again. Now I guess there is the possibility that if another task create call came in at that 95% retryWait time that task may try again rather then the first task that failed, but some task should try again at that point. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
