cnauroth commented on code in PR #7857: URL: https://github.com/apache/hadoop/pull/7857#discussion_r2261601292
########## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/gpu/TestGpuDiscoverer.java: ########## @@ -297,6 +297,36 @@ public void testGetGpuDeviceInformationFaultyNvidiaSmiScriptConsecutiveRun() assertNotNull(discoverer.getGpusUsableByYarn()); } + @Test + public void testGetGpuDeviceInformationDisableMaxErrors() + throws YarnException, IOException { + Configuration conf = new Configuration(false); + // A negative value should disable max errors enforcement. + conf.setInt(YarnConfiguration.NM_GPU_DISCOVERY_MAX_ERRORS, -1); + + File fakeBinary = createFakeNvidiaSmiScriptAsRunnableFile( + this::createFaultyNvidiaSmiScript); + + GpuDiscoverer discoverer = creatediscovererWithGpuPathDefined(conf); + assertEquals(fakeBinary.getAbsolutePath(), + discoverer.getPathOfGpuBinary()); + assertNull(discoverer.getEnvironmentToRunCommand().get(PATH)); + + final String terminateMsg = "Failed to execute GPU device " + + "detection script (" + fakeBinary.getAbsolutePath() + ") for 10 times"; + final String msg = "Failed to execute GPU device detection script"; + + // The default max errors is 10. Verify that it keeps going for an 11th try. + for (int i = 0; i < 11; ++i) { Review Comment: This test is covering the case where you disable the max errors by setting a negative value. To make this clearer, I dialed it up to 20 attempts, and I also added another test that sets the configuration to 11 and confirms it tries exactly 11 times. ########## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml: ########## @@ -4650,6 +4650,34 @@ <value></value> </property> + <property> + <description> + Sets the maximum duration for executions of the discovery binary defined in + yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables. If + the binary takes longer than this amount of time to run, then the process + is aborted. Discovery may be attempted again, depending on + yarn.nodemanager.resource-plugins.gpu.discovery-max-errors. + </description> + <name>yarn.nodemanager.resource-plugins.gpu.discovery-timeout</name> + <value>10000ms</value> Review Comment: Sounds good, updated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org