Github user jerryshao commented on the issue:
https://github.com/apache/spark/pull/19145
Did you enable RM or NM recovery, can you please clarify it?
Normally, if we assume there's are 2 containers running on this NM, after
10 minutes, RM will detect the failure of NM and relaunch 2 lost containers in
other NMs, and the total number of executors should still be the same. But
things will be different if we enabled NM recovery, because now the failure of
NM will not lead to container lost.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]