Re: [PR] YARN-11709. NodeManager should be shut down or blacklisted when it ca… [hadoop]

via GitHub Wed, 07 Aug 2024 08:05:44 -0700


ferdelyi commented on code in PR #6960:
URL: https://github.com/apache/hadoop/pull/6960#discussion_r1707211380



##########
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java:
##########
@@ -451,8 +451,10 @@ public void startLocalizer(LocalizerStartContext ctx)
 
     } catch (PrivilegedOperationException e) {
       int exitCode = e.getExitCode();
-      LOG.warn("Exit code from container {} startLocalizer is : {}",
-          locId, exitCode, e);
+      LOG.error("Unrecoverable issue occurred. Marking the node as unhealthy 
to prevent "
+          + "further containers to get scheduled on the node and cause 
application failures. " +
+          "Exit code from the container " + locId + "startLocalizer is : " + 
exitCode, e);
+      nmContext.getNodeStatusUpdater().reportException(e);

Review Comment:
   @zeekling thank you for looking into this change! Yes, when we hit an 
unrecoverable issue with the NM, the root cause needs to be fixed and the NM 
manually restarted. This way the RM will not schedule applications to the node 
while the issue is present. When we let the RM to place containers to the 
faulty NM, it can lead to application failures. E.g. by reaching maximum number 
of application attempts when the AM was scheduled to the same node twice.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] YARN-11709. NodeManager should be shut down or blacklisted when it ca… [hadoop]

Reply via email to