lokeshj1703 commented on code in PR #3751:
URL: https://github.com/apache/ozone/pull/3751#discussion_r978362904


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -568,6 +563,20 @@ private void checkIterationMoveResults() {
         metrics.getNumContainerMovesCompletedInLatestIteration());
   }
 
+  private long cancelAndCountPendingMoves() {
+    return moveSelectionToFutureMap.entrySet().stream()
+        .filter(entry -> !entry.getValue().isDone())
+        .peek(entry -> {
+          LOG.warn("Container move canceled for container {} from source {}" +
+                  " to target {}.",
+              entry.getKey().getContainerID(),
+              containerToSourceMap.get(entry.getKey().getContainerID())
+                  .getUuidString(),
+              entry.getKey().getTargetNode().getUuidString());
+          entry.getValue().cancel(true);

Review Comment:
   Based on cancel javadoc, the cancel function completes the future 
exceptionally. I think if we call this fn on InterruptedException, then it 
would mark all the pending moves as failed in the whenComplete call. I do not 
think we should count this as move failure. It could trigger an unnecessary 
alarm for the administrator.



##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -529,19 +529,12 @@ private void checkIterationMoveResults() {
       allFuturesResult.get(config.getMoveTimeout().toMillis(),
           TimeUnit.MILLISECONDS);
     } catch (InterruptedException e) {
+      long cancelCount = cancelAndCountPendingMoves();
+      LOG.warn("Container balancer is interrupted and moves are cancelled {}",
+          cancelCount);
       Thread.currentThread().interrupt();
     } catch (TimeoutException e) {
-      long timeoutCounts = moveSelectionToFutureMap.entrySet().stream()
-          .filter(entry -> !entry.getValue().isDone())
-          .peek(entry -> {
-            LOG.warn("Container move canceled for container {} from source {}" 
+
-                    " to target {} due to timeout.",
-                entry.getKey().getContainerID(),
-                containerToSourceMap.get(entry.getKey().getContainerID())
-                    .getUuidString(),
-                entry.getKey().getTargetNode().getUuidString());
-            entry.getValue().cancel(true);

Review Comment:
   Unrelated to PR, I think for timeout we have been cancelling the future. 
This would mean that timeout is considered as both move failure and move 
timeout. I think it would be better to consider the different metrics as 
disjoint sets (completed, timeout, failed)?
   @siddhantsangwan 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to