lokeshj1703 commented on code in PR #3751:
URL: https://github.com/apache/ozone/pull/3751#discussion_r977530477
##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -529,22 +529,17 @@ private void checkIterationMoveResults() {
allFuturesResult.get(config.getMoveTimeout().toMillis(),
TimeUnit.MILLISECONDS);
} catch (InterruptedException e) {
+ long cancelCount = cancelAndCountPendingMoves();
+ metrics.incrementNumContainerMovesFailed(cancelCount);
+ LOG.warn("Container balancer is interrupted and moves are cancelled {}",
+ cancelCount);
Review Comment:
We do not need to handle this since InterruptedException would occur in case
balancer is stopped. We shouldn't mark the moves as failed then. It could be
misleading.
##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -529,22 +529,17 @@ private void checkIterationMoveResults() {
allFuturesResult.get(config.getMoveTimeout().toMillis(),
TimeUnit.MILLISECONDS);
} catch (InterruptedException e) {
+ long cancelCount = cancelAndCountPendingMoves();
+ metrics.incrementNumContainerMovesFailed(cancelCount);
+ LOG.warn("Container balancer is interrupted and moves are cancelled {}",
+ cancelCount);
Thread.currentThread().interrupt();
} catch (TimeoutException e) {
- long timeoutCounts = moveSelectionToFutureMap.entrySet().stream()
- .filter(entry -> !entry.getValue().isDone())
- .peek(entry -> {
- LOG.warn("Container move canceled for container {} from source {}"
+
- " to target {} due to timeout.",
- entry.getKey().getContainerID(),
- containerToSourceMap.get(entry.getKey().getContainerID())
- .getUuidString(),
- entry.getKey().getTargetNode().getUuidString());
- entry.getValue().cancel(true);
- }).count();
+ long timeoutCounts = cancelAndCountPendingMoves();
LOG.warn("{} Container moves are canceled.", timeoutCounts);
metrics.incrementNumContainerMovesTimeoutInLatestIteration(timeoutCounts);
} catch (ExecutionException e) {
+ cancelAndCountPendingMoves();
Review Comment:
Based on javadoc
ExecutionException – if this future completed exceptionally
This case would be handled in the respective whenComplete operation.
##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -529,22 +529,15 @@ private void checkIterationMoveResults() {
allFuturesResult.get(config.getMoveTimeout().toMillis(),
TimeUnit.MILLISECONDS);
} catch (InterruptedException e) {
+ long failedCount = performMoveCancelAndFailCount();
+ metrics.incrementNumContainerMovesFailed(failedCount);
Review Comment:
I feel this is not a failure scenario since the balancer is being stopped
here.
##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -568,6 +561,39 @@ private void checkIterationMoveResults() {
metrics.getNumContainerMovesCompletedInLatestIteration());
}
+ private long performTimeMoveCancelAndCount() {
+ return moveSelectionToFutureMap.entrySet().stream()
+ .filter(entry -> !entry.getValue().isDone())
+ .peek(entry -> {
+ LOG.warn("Container move canceled for container {} from source {}" +
+ " to target {} due to timeout.",
+ entry.getKey().getContainerID(),
+ containerToSourceMap.get(entry.getKey().getContainerID())
+ .getUuidString(),
+ entry.getKey().getTargetNode().getUuidString());
+ entry.getValue().cancel(true);
Review Comment:
Since we are cancelling the future, it can throw exception which would be
handled by the whenComplete leading to metric being updated twice.
##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancerMetrics.java:
##########
@@ -74,6 +74,10 @@ public final class ContainerBalancerMetrics {
"Container Balancer.")
private MutableCounterLong dataSizeMovedGB;
+ @Metric(about = "Total number container for which moves failed " +
+ "exceptionally across all iterations of Container Balancer.")
+ private MutableCounterLong numContainerMovesFailed;
+
Review Comment:
We have numContainerMovesCompletedInLatestIteration. I think we can have
similar metric numContainerMovesFailedInLatestIteration
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]