sadanand48 commented on code in PR #6189:
URL: https://github.com/apache/ozone/pull/6189#discussion_r1483084704


##########
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/request/key/OMKeyRequest.java:
##########
@@ -199,18 +199,29 @@ protected List< OmKeyLocationInfo > 
allocateBlock(ScmClient scmClient,
     List<OmKeyLocationInfo> locationInfos = new ArrayList<>(numBlocks);
     String remoteUser = getRemoteUser().getShortUserName();
     List<AllocatedBlock> allocatedBlocks;
-    try {
-      allocatedBlocks = scmClient.getBlockClient()
-          .allocateBlock(scmBlockSize, numBlocks, replicationConfig, serviceID,
-              excludeList, clientMachine);
-    } catch (SCMException ex) {
-      omMetrics.incNumBlockAllocateCallFails();
-      if (ex.getResult()
-          .equals(SCMException.ResultCodes.SAFE_MODE_EXCEPTION)) {
-        throw new OMException(ex.getMessage(),
-            OMException.ResultCodes.SCM_IN_SAFE_MODE);
+    int retryCount = 5;
+    while (true) {
+      try {
+        allocatedBlocks = scmClient.getBlockClient()
+            .allocateBlock(scmBlockSize, numBlocks, replicationConfig, 
serviceID,
+                excludeList, clientMachine);
+      } catch (SCMException ex) {
+        omMetrics.incNumBlockAllocateCallFails();
+        if 
(ex.getResult().equals(SCMException.ResultCodes.SAFE_MODE_EXCEPTION) && 
retryCount > 0) {
+          retryCount--;
+          // SCM is in safe mode, retry again

Review Comment:
   Maybe add a debug log saying " allocate block failed as SCM is in safe mode 
, x retries remaining"



##########
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/request/key/OMKeyRequest.java:
##########
@@ -199,18 +199,29 @@ protected List< OmKeyLocationInfo > 
allocateBlock(ScmClient scmClient,
     List<OmKeyLocationInfo> locationInfos = new ArrayList<>(numBlocks);
     String remoteUser = getRemoteUser().getShortUserName();
     List<AllocatedBlock> allocatedBlocks;
-    try {
-      allocatedBlocks = scmClient.getBlockClient()
-          .allocateBlock(scmBlockSize, numBlocks, replicationConfig, serviceID,
-              excludeList, clientMachine);
-    } catch (SCMException ex) {
-      omMetrics.incNumBlockAllocateCallFails();
-      if (ex.getResult()
-          .equals(SCMException.ResultCodes.SAFE_MODE_EXCEPTION)) {
-        throw new OMException(ex.getMessage(),
-            OMException.ResultCodes.SCM_IN_SAFE_MODE);
+    int retryCount = 5;

Review Comment:
   Can use a static constant instead



##########
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/request/key/OMKeyRequest.java:
##########
@@ -199,18 +199,29 @@ protected List< OmKeyLocationInfo > 
allocateBlock(ScmClient scmClient,
     List<OmKeyLocationInfo> locationInfos = new ArrayList<>(numBlocks);
     String remoteUser = getRemoteUser().getShortUserName();
     List<AllocatedBlock> allocatedBlocks;
-    try {
-      allocatedBlocks = scmClient.getBlockClient()
-          .allocateBlock(scmBlockSize, numBlocks, replicationConfig, serviceID,
-              excludeList, clientMachine);
-    } catch (SCMException ex) {
-      omMetrics.incNumBlockAllocateCallFails();
-      if (ex.getResult()
-          .equals(SCMException.ResultCodes.SAFE_MODE_EXCEPTION)) {
-        throw new OMException(ex.getMessage(),
-            OMException.ResultCodes.SCM_IN_SAFE_MODE);
+    int retryCount = 5;
+    while (true) {
+      try {
+        allocatedBlocks = scmClient.getBlockClient()
+            .allocateBlock(scmBlockSize, numBlocks, replicationConfig, 
serviceID,
+                excludeList, clientMachine);
+      } catch (SCMException ex) {
+        omMetrics.incNumBlockAllocateCallFails();
+        if 
(ex.getResult().equals(SCMException.ResultCodes.SAFE_MODE_EXCEPTION) && 
retryCount > 0) {
+          retryCount--;
+          // SCM is in safe mode, retry again
+          try {
+            Thread.sleep(3000);

Review Comment:
   can use static constant for 3000 also 



##########
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestScmSafeMode.java:
##########
@@ -334,4 +343,46 @@ public void testSCMSafeModeDisabled() throws Exception {
     cluster.restartStorageContainerManager(true);
     assertFalse(scm.isInSafeMode());
   }
+
+  @Test
+  public void testCreateRetryWhileSCMSafeMode() throws Exception {
+    // Test1: Test safe mode  when there are no containers in system.
+    cluster.stop();
+
+    try {
+      cluster = builder.build();
+    } catch (IOException e) {
+      fail("Cluster startup failed.");
+    }
+
+    final String rootPath = String.format("%s://%s/",
+        OZONE_OFS_URI_SCHEME, conf.get(OZONE_OM_ADDRESS_KEY));
+    conf.set(CommonConfigurationKeysPublic.FS_DEFAULT_NAME_KEY, rootPath);
+
+    try (FileSystem fs = FileSystem.get(conf)) {
+      assertTrue(((SafeMode)fs).setSafeMode(SafeModeAction.GET));
+
+      Thread t = new Thread(() -> {
+        try {
+          LOG.info("Sleep 10 seconds and then start DataNodes.");
+          Thread.sleep(10 * 1000);
+
+          cluster.startHddsDatanodes();
+          cluster.waitForClusterToBeReady();
+          cluster.waitTobeOutOfSafeMode();
+        } catch (InterruptedException | TimeoutException e) {
+          throw new RuntimeException(e);
+        }
+      });
+      t.start();
+
+      final Path file = new Path("file");
+      try (FSDataOutputStream outputStream = fs.create(file, true)) {
+        LOG.info("Successfully created a file");

Review Comment:
   This can be flaky at times, can we instead verify the  omMetric : 
NumBlockAllocateCallFails is 5 when SCM is in safemode. I'm also okay if the 
current test passes flaky check if executed multiple times.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to