tgravescs commented on pull request #27943:
URL: https://github.com/apache/spark/pull/27943#issuecomment-636930244


   do you see the nodes being blacklisted or are there not enough failures to 
cause blacklisting?  In general though it sounds like its GC or some other 
slowness and the node manager comes back fairly quickly?  Can I ask what your 
settings are now and perhaps how long of GC pauses you are seeing?
   
   >  I have one more question here, does the first task that failed to create 
the client and set the lastConnectionFailed will hit the fast fail logic when 
it retries later?
   
   No, its supposed to fail any attempts that happen after the first task 
failed up til 95% of the spark.shuffle.io.retryWait timeout.  At that point it 
should let any create calls go through to try again. That means the first task 
that failed should be trying to connect again.  Now I guess there is the 
possibility that if another task create call came in at that 95% retryWait time 
that task may try again rather then the first task that failed, but some task 
should try again at that point.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to