siddhantsangwan commented on code in PR #3751:
URL: https://github.com/apache/ozone/pull/3751#discussion_r970994760
##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -544,8 +545,10 @@ private void checkIterationMoveResults() {
}).count();
LOG.warn("{} Container moves are canceled.", timeoutCounts);
metrics.incrementNumContainerMovesTimeoutInLatestIteration(timeoutCounts);
+ metrics.incrementNumContainerMovesFailed(timeoutCounts);
} catch (ExecutionException e) {
LOG.error("Got exception while checkIterationMoveResults", e);
+
metrics.incrementNumContainerMovesFailed(moveSelectionToFutureMap.size());
Review Comment:
Does getting an ExecutionException mean that all container moves failed? We
probably still need to check individual move results like we're doing above for
`TimeoutException`.
##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -529,6 +529,7 @@ private void checkIterationMoveResults() {
allFuturesResult.get(config.getMoveTimeout().toMillis(),
TimeUnit.MILLISECONDS);
} catch (InterruptedException e) {
+ metrics.incrementNumContainerMovesFailed(1);
Review Comment:
Not sure I understand this change - why are we increasing move failures by
one if the current thread is interrupted?
##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -544,8 +545,10 @@ private void checkIterationMoveResults() {
}).count();
LOG.warn("{} Container moves are canceled.", timeoutCounts);
metrics.incrementNumContainerMovesTimeoutInLatestIteration(timeoutCounts);
+ metrics.incrementNumContainerMovesFailed(timeoutCounts);
Review Comment:
If we're also considering timeouts as move failures, we need to mention this
in the metrics description. We'd also need to update failure metrics in other
places where timeouts are being calculated, such as
`ContainerBalancerMetrics#incrementCurrentIterationContainerMoveMetric`.
Maybe it's better to keep timeout and failures separate?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]