yaooqinn edited a comment on pull request #27943:
URL: https://github.com/apache/spark/pull/27943#issuecomment-637657291
Thanks, @tgravescs,
The `spark.blacklist.application.fetchFailure.enabled` looks too risky for
me.
The `spark.reducer.maxBlocksInFlightPerAddress` looks interesting and very
useful as it is documented, but I think I have to do more testing before
production use.
Increasing `maxRetries` and `retryWait` is the way I am going to try first.
But I think increasing `maxRetries` is much better, as most retries will fail
directly and fast on the client-side with no requests to servers, and short
`retryWait`s can give clients more opportunities to get connected with servers.
---
BTW, this is a good improvement. But we should pay more attention to
imperceptible change on
`maxRetries` and `retryWait` although they just come back to what they
exactly mean.
The reason why I urgently patched this to our in-house distribution is that
this same spark job I mentioned above was delayed for hours just because of one
single NodeManager crashed during the shuffle read phase.
The 2 stages were aborted and resubmitted but most of the executors were not
freed. One thing is that we don't kill the tasks from the aborted stages but
let them fail by themselves, the other is these tasks which were doomed to fail
are in `the issue was that the wait time before wasn't really what it was
waiting`. However, the tasks in the new attempt could not get enough resources
to run.
When the hosts for client and server are in the same network segment,
`NoRouteToHostException` will immediately fail the connecting attempt and the
task eventually when the `maxReties` exhausted, but when they are in the
different network segments, it will wait for `connectionTimeOut ` util get
`SocketException` of the connection timeout and retry... In general, the hosts
are more likely to be in the different network segments for large clusters,
that's one reason why the fast fail here matters.
As this is merged into master only, I hope the story I share will help
others when they are looking for something just like I did.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]