ferdelyi commented on code in PR #6960:
URL: https://github.com/apache/hadoop/pull/6960#discussion_r1707211380
##########
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java:
##########
@@ -451,8 +451,10 @@ public void startLocalizer(LocalizerStartContext ctx)
} catch (PrivilegedOperationException e) {
int exitCode = e.getExitCode();
- LOG.warn("Exit code from container {} startLocalizer is : {}",
- locId, exitCode, e);
+ LOG.error("Unrecoverable issue occurred. Marking the node as unhealthy
to prevent "
+ + "further containers to get scheduled on the node and cause
application failures. " +
+ "Exit code from the container " + locId + "startLocalizer is : " +
exitCode, e);
+ nmContext.getNodeStatusUpdater().reportException(e);
Review Comment:
@zeekling thank you for looking into this change! Yes, when we hit an
unrecoverable issue with the NM, the root cause needs to be fixed and the NM
manually restarted. This way the RM will not schedule applications to the node
while the issue is present. When we let the RM to place containers to the
faulty NM, it can lead to application failures. E.g. by reaching maximum number
of application attempts when the AM was scheduled to the same node twice.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]