cnauroth commented on code in PR #7857:
URL: https://github.com/apache/hadoop/pull/7857#discussion_r2261601292


##########
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/gpu/TestGpuDiscoverer.java:
##########
@@ -297,6 +297,36 @@ public void 
testGetGpuDeviceInformationFaultyNvidiaSmiScriptConsecutiveRun()
     assertNotNull(discoverer.getGpusUsableByYarn());
   }
 
+  @Test
+  public void testGetGpuDeviceInformationDisableMaxErrors()
+      throws YarnException, IOException {
+    Configuration conf = new Configuration(false);
+    // A negative value should disable max errors enforcement.
+    conf.setInt(YarnConfiguration.NM_GPU_DISCOVERY_MAX_ERRORS, -1);
+
+    File fakeBinary = createFakeNvidiaSmiScriptAsRunnableFile(
+        this::createFaultyNvidiaSmiScript);
+
+    GpuDiscoverer discoverer = creatediscovererWithGpuPathDefined(conf);
+    assertEquals(fakeBinary.getAbsolutePath(),
+        discoverer.getPathOfGpuBinary());
+    assertNull(discoverer.getEnvironmentToRunCommand().get(PATH));
+
+    final String terminateMsg = "Failed to execute GPU device " +
+        "detection script (" + fakeBinary.getAbsolutePath() + ") for 10 times";
+    final String msg = "Failed to execute GPU device detection script";
+
+    // The default max errors is 10. Verify that it keeps going for an 11th 
try.
+    for (int i = 0; i < 11; ++i) {

Review Comment:
   This test is covering the case where you disable the max errors by setting a 
negative value. To make this clearer, I dialed it up to 20 attempts, and I also 
added another test that sets the configuration to 11 and confirms it tries 
exactly 11 times.



##########
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml:
##########
@@ -4650,6 +4650,34 @@
     <value></value>
   </property>
 
+  <property>
+    <description>
+      Sets the maximum duration for executions of the discovery binary defined 
in
+      yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables. If
+      the binary takes longer than this amount of time to run, then the process
+      is aborted. Discovery may be attempted again, depending on
+      yarn.nodemanager.resource-plugins.gpu.discovery-max-errors.
+    </description>
+    <name>yarn.nodemanager.resource-plugins.gpu.discovery-timeout</name>
+    <value>10000ms</value>

Review Comment:
   Sounds good, updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to