Hean-Chhinling commented on code in PR #8222:
URL: https://github.com/apache/hadoop/pull/8222#discussion_r2747564944


##########
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/conf/ZKConfigurationStore.java:
##########
@@ -263,10 +264,46 @@ private boolean createNewZkPath(String path) throws 
Exception {
     }
   }
 
+  /*
   @VisibleForTesting
   protected byte[] getZkData(String path) throws Exception {
     return zkManager.getData(path);
   }
+*/
+
+  @VisibleForTesting
+  protected byte[] getZkData(String path) throws Exception {
+    // Should a 'yarn resourcemanager -format-state-store' command is issued
+    // while one of the RM is in a starting state, there is a time period
+    // when the /confstore/CONF_STORE path does not exist, hence the
+    // getZkData method returns a null value causing the RM to fail.
+    // To prevent this, added a re-try mechanism before giving up.
+    int maxRetries = 6;

Review Comment:
   What do you think of making the Max retries configurable like the 
sleepBetweenRetries?



##########
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/conf/ZKConfigurationStore.java:
##########
@@ -263,10 +264,46 @@ private boolean createNewZkPath(String path) throws 
Exception {
     }
   }
 
+  /*

Review Comment:
   I think we could delete the old getZkData method here instead of commented 
it out



##########
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/conf/ZKConfigurationStore.java:
##########
@@ -263,10 +264,46 @@ private boolean createNewZkPath(String path) throws 
Exception {
     }
   }
 
+  /*
   @VisibleForTesting
   protected byte[] getZkData(String path) throws Exception {
     return zkManager.getData(path);
   }
+*/
+
+  @VisibleForTesting
+  protected byte[] getZkData(String path) throws Exception {
+    // Should a 'yarn resourcemanager -format-state-store' command is issued
+    // while one of the RM is in a starting state, there is a time period
+    // when the /confstore/CONF_STORE path does not exist, hence the
+    // getZkData method returns a null value causing the RM to fail.
+    // To prevent this, added a re-try mechanism before giving up.
+    int maxRetries = 6;
+    int attempt = 1;
+    int sleepBetweenRetries = conf.getInt(
+        YarnConfiguration.RM_SCHEDCONF_STORE_ZK_READ_RETRY_SECS,
+        YarnConfiguration.DEFAULT_RM_SCHEDCONF_STORE_ZK_READ_RETRY_SECS);
+
+    while (attempt < maxRetries) {
+
+        if(zkManager.exists(path)) {
+          LOG.debug("zkManager.exists(path) {} exists.", path);
+          byte[] zkData = zkManager.getData(path);
+          // If we found data, return immediately

Review Comment:
   Maybe this comment is not needed and a little misleading.
   Because we also check if the data is not null and not empty?



##########
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java:
##########
@@ -1007,6 +1007,16 @@ public static boolean isAclEnabled(Configuration conf) {
   public static final String DEFAULT_RM_SCHEDCONF_STORE_ZK_PARENT_PATH =
       "/confstore";
 
+  /** Retry period for  ZKConfigurationStore will create znodes. */
+  @Private
+  @Unstable

Review Comment:
   I am just curious here.
   What does the "unstable" annotation means here and why do we it?



##########
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/conf/TestZKConfigurationStore.java:
##########
@@ -198,7 +198,27 @@ public void testDisableAuditLogs() throws Exception {
     prepareLogMutation("key1", "val1");
 
     data = ((ZKConfigurationStore) confStore).getZkData(logsPath);
-    assertNull(data, "Failed to Disable Audit Logs");
+    assertEquals("Failed to Disable Audit Logs", 0, data.length);
+  }
+
+  @Test
+  public void testZkData() throws Exception {
+    conf.setInt(YarnConfiguration.RM_SCHEDCONF_STORE_ZK_READ_RETRY_SECS, 0);
+    confStore.initialize(conf, schedConf, rmContext);
+    String confStorePath = getZkPath("CONF_STORE");
+    byte[] data = ((ZKConfigurationStore) confStore).getZkData(confStorePath);
+    assertTrue("ConfStore data expected to be not null.", data.length > 0);
+  }
+
+  @Test
+  public void testEmptyZkData() throws Exception {
+    conf.setInt(YarnConfiguration.RM_SCHEDCONF_STORE_ZK_READ_RETRY_SECS, 0);
+    confStore.initialize(conf, schedConf, rmContext);

Review Comment:
   What do you think of adding some data here before calling format? 
   So that we make sure the data is empty?



##########
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/conf/TestZKConfigurationStore.java:
##########
@@ -198,7 +198,27 @@ public void testDisableAuditLogs() throws Exception {
     prepareLogMutation("key1", "val1");
 
     data = ((ZKConfigurationStore) confStore).getZkData(logsPath);
-    assertNull(data, "Failed to Disable Audit Logs");
+    assertEquals("Failed to Disable Audit Logs", 0, data.length);
+  }
+
+  @Test
+  public void testZkData() throws Exception {
+    conf.setInt(YarnConfiguration.RM_SCHEDCONF_STORE_ZK_READ_RETRY_SECS, 0);
+    confStore.initialize(conf, schedConf, rmContext);
+    String confStorePath = getZkPath("CONF_STORE");
+    byte[] data = ((ZKConfigurationStore) confStore).getZkData(confStorePath);
+    assertTrue("ConfStore data expected to be not null.", data.length > 0);

Review Comment:
   I think it would be nice to add the configuration data here.
   Because it seems like we are checking against empty configuration 
[here](https://github.com/apache/hadoop/blob/90150aff3f84183cebaf21e977834b8a77e8ede7/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/conf/ConfigurationStoreBaseTest.java#L50)
 
   
   E.g. setting data 
([code](https://github.com/apache/hadoop/blob/90150aff3f84183cebaf21e977834b8a77e8ede7/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/conf/TestZKConfigurationStore.java#L138))



##########
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/conf/ZKConfigurationStore.java:
##########
@@ -263,10 +264,46 @@ private boolean createNewZkPath(String path) throws 
Exception {
     }
   }
 
+  /*
   @VisibleForTesting
   protected byte[] getZkData(String path) throws Exception {
     return zkManager.getData(path);
   }
+*/
+
+  @VisibleForTesting
+  protected byte[] getZkData(String path) throws Exception {
+    // Should a 'yarn resourcemanager -format-state-store' command is issued
+    // while one of the RM is in a starting state, there is a time period
+    // when the /confstore/CONF_STORE path does not exist, hence the

Review Comment:
   The issue also occur when the /confstore/CONF_STORE path exists but the data 
is not fully written this also cause another RM to fail without the 
root.default queue. Thus retry and make sure the data is not empty.
   I think we could add the followings:
   
   "when the /confstore/CONF_STORE path does exists or when the data is not 
fully written"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to