yaooqinn edited a comment on pull request #27943:
URL: https://github.com/apache/spark/pull/27943#issuecomment-637657291


   Thanks, @tgravescs,
   The `spark.blacklist.application.fetchFailure.enabled` looks too risky for 
me.
   
   The `spark.reducer.maxBlocksInFlightPerAddress` looks interesting and very 
useful as it is documented, but I think I have to do more testing before 
production use.
   
   
   Increasing `maxRetries` and `retryWait` is the way I am going to try first. 
But I think increasing `maxRetries` is much better, as most retries will fail 
directly and fast on the client-side with no requests to servers, and short 
`retryWait`s can give clients more opportunities to get connected with servers.
   
   ---
   
   BTW, this is a good improvement. But we should pay more attention to 
imperceptible change on
   `maxRetries` and `retryWait` although they just come back to what they 
exactly mean. 
   
   The reason why I urgently patched this to our in-house distribution is that 
this same spark job I mentioned above was delayed for hours just because of one 
single NodeManager crashed during the shuffle read phase.
   
   The 2 stages were aborted and resubmitted but most of the executors were not 
freed. One thing is that we don't kill the tasks from the aborted stages but 
let them fail by themselves, the other is these tasks which were doomed to fail 
are in `the issue was that the wait time before wasn't really what it was 
waiting`. However, the tasks in the new attempt could not get enough resources 
to run.
   When the hosts for client and server are in the same network segment, 
`NoRouteToHostException` will immediately fail the connecting attempt and the 
task eventually when the `maxReties` exhausted, but when they are in the 
different network segments, it will wait for `connectionTimeOut ` util get 
`SocketException` of the connection timeout and retry... In general, the hosts 
are more likely to be in the different network segments for large clusters, 
that's one reason why the fast fail here matters.
   
    As this is merged into master only, I hope the story I share will help 
others when they are looking for something just like I did.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to