[GitHub] [hadoop] goiri commented on a diff in pull request #5905: [YARN-11421] Graceful Decommission ignores launched containers and gets deactivated before timeout

via GitHub Tue, 15 Aug 2023 15:14:04 -0700


goiri commented on code in PR #5905:
URL: https://github.com/apache/hadoop/pull/5905#discussion_r1295183661



##########
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/DecommissioningNodesWatcher.java:
##########
@@ -126,6 +127,11 @@ public void init(Configuration conf) {
         YarnConfiguration.RM_DECOMMISSIONING_NODES_WATCHER_POLL_INTERVAL,
         YarnConfiguration
           .DEFAULT_RM_DECOMMISSIONING_NODES_WATCHER_POLL_INTERVAL);
+    // expire interval should not be configured more than 
RM_AM_EXPIRY_INTERVAL_MS
+    this.expireIntvl = 
Math.min(conf.getLong(YarnConfiguration.RM_AM_EXPIRY_INTERVAL_MS,
+            YarnConfiguration.DEFAULT_RM_AM_EXPIRY_INTERVAL_MS),
+    conf.getInt(YarnConfiguration.RM_DECOMMISSIONING_NODES_WATCHER_DELAY_MS,

Review Comment:
   This code is a little hard to read, maybe extracking?



##########
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestDecommissioningNodesWatcher.java:
##########
@@ -162,6 +162,80 @@ public void 
testDecommissioningNodesWatcherWithPreviousRunningApps()
         watcher.checkDecommissioningStatus(id1));
   }
 
+  @Test
+  public void testDecommissioningNodesWatcherWithScheduledAMContainers()
+          throws Exception {
+    Configuration conf = new Configuration();
+    // decommission timeout is 10 min
+    conf.set(YarnConfiguration.RM_NODE_GRACEFUL_DECOMMISSION_TIMEOUT, "600");
+    // watcher delay is 5min
+    conf.set(YarnConfiguration.RM_DECOMMISSIONING_NODES_WATCHER_DELAY_MS, 
"300000");
+
+    rm = new MockRM(conf);
+    rm.start();
+
+    DecommissioningNodesWatcher watcher =
+            new DecommissioningNodesWatcher(rm.getRMContext());
+    watcher.init(conf);
+
+    MockNM nm1 = rm.registerNode("host1:1234", 10240);
+    RMNodeImpl node1 =
+            (RMNodeImpl) rm.getRMContext().getRMNodes().get(nm1.getNodeId());
+    NodeId id1 = nm1.getNodeId();
+
+    rm.waitForState(id1, NodeState.RUNNING);
+
+    // just submit the app
+    RMApp app = MockRMAppSubmitter.submit(rm,
+            MockRMAppSubmissionData.Builder.createWithMemory(2000, 
rm).build());
+    MockAM am = MockRM.launchAndRegisterAM(app, rm, nm1);
+
+    // rmnode still thinks there are 0 apps because it hasn't received updated 
node status
+    Assert.assertEquals(0, node1.getRunningApps().size());
+
+    // Setup nm1 as DECOMMISSIONING for DecommissioningNodesWatcher. Right now
+    // there is no container running on the node.
+    rm.sendNodeGracefulDecommission(nm1,
+            YarnConfiguration.DEFAULT_RM_NODE_GRACEFUL_DECOMMISSION_TIMEOUT);
+    rm.waitForState(id1, NodeState.DECOMMISSIONING);
+
+    // we should still get WAIT_SCHEDULED_APPS as expiry time is not over
+    NodeHealthStatus status = NodeHealthStatus.newInstance(true, "",
+            System.currentTimeMillis() - 1000);
+    NodeStatus nodeStatus = NodeStatus.newInstance(id1, 0,
+            new ArrayList<>(), Collections.emptyList(), status, null, null, 
null);
+    watcher.update(node1, nodeStatus);
+
+    Assert.assertFalse(watcher.checkReadyToBeDecommissioned(id1));
+    Assert.assertEquals(DecommissioningNodeStatus.WAIT_SCHEDULED_APPS,
+            watcher.checkDecommissioningStatus(id1));
+
+    // update node with 3 running containers
+    nodeStatus = createNodeStatus(id1, app, 3);
+    node1.handle(new RMNodeStatusEvent(nm1.getNodeId(), nodeStatus));
+    watcher.update(node1, nodeStatus);
+    Assert.assertEquals(1, node1.getRunningApps().size());
+    Assert.assertFalse(watcher.checkReadyToBeDecommissioned(id1));
+    Assert.assertEquals(DecommissioningNodeStatus.WAIT_CONTAINER,
+        watcher.checkDecommissioningStatus(id1));
+
+    // update node with 0 running containers
+    nodeStatus = createNodeStatus(id1, app, 0);
+    node1.handle(new RMNodeStatusEvent(nm1.getNodeId(), nodeStatus));
+    watcher.update(node1, nodeStatus);
+    Assert.assertFalse(watcher.checkReadyToBeDecommissioned(id1));
+    // we still get status as WAIT_APP since app is still running even if
+    // containers are 0
+    Assert.assertEquals(DecommissioningNodeStatus.WAIT_APP,
+            watcher.checkDecommissioningStatus(id1));

Review Comment:
   Indentation must be wrong.



##########
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/DecommissioningNodesWatcher.java:
##########
@@ -247,7 +257,15 @@ public DecommissioningNodeStatus 
checkDecommissioningStatus(NodeId nodeId) {
     }
 
     if (context.appIds.size() == 0) {
-      return DecommissioningNodeStatus.READY;
+      // wait for am expire interval or decommission timeout whichever is 
smaller
+      // if decommission timeout is negative, use am expire interval
+      long effectiveTimeout = context.timeoutMs > 0 ? 
Math.min(context.timeoutMs, expireIntvl) :
+          expireIntvl;
+      LOG.debug("checkReadyToBeDecommissioned " + nodeId + ", 
context.timeoutMs=" +

Review Comment:
   Use logger {}.



##########
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/DecommissioningNodesWatcher.java:
##########
@@ -126,6 +127,11 @@ public void init(Configuration conf) {
         YarnConfiguration.RM_DECOMMISSIONING_NODES_WATCHER_POLL_INTERVAL,
         YarnConfiguration
           .DEFAULT_RM_DECOMMISSIONING_NODES_WATCHER_POLL_INTERVAL);
+    // expire interval should not be configured more than 
RM_AM_EXPIRY_INTERVAL_MS
+    this.expireIntvl = 
Math.min(conf.getLong(YarnConfiguration.RM_AM_EXPIRY_INTERVAL_MS,

Review Comment:
   Can we use getTimeDuration()?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [hadoop] goiri commented on a diff in pull request #5905: [YARN-11421] Graceful Decommission ignores launched containers and gets deactivated before timeout

Reply via email to