GrantPSpencer commented on code in PR #2886:
URL: https://github.com/apache/helix/pull/2886#discussion_r1728034271


##########
helix-rest/src/main/java/org/apache/helix/rest/clusterMaintenanceService/HealthCheck.java:
##########
@@ -54,21 +55,24 @@ public enum HealthCheck {
   /**
    * Check if instance has error partitions
    */
-  HAS_ERROR_PARTITION,
-  /**
-   * Check if all resources hosted on the instance can still meet the min 
active replica
-   * constraint if this instance is shutdown
-   */
-  MIN_ACTIVE_REPLICA_CHECK_FAILED;

Review Comment:
   Good point. Excluding the MIN_ACTIVE_REPLICA_CHECK_FAILED definitely works. 
We could also just set the MIN_ACTIVE_REPLICA_CHECK_FAILED switch case in 
`getInstanceHealthStatus()` to a no-op? What do you think is the clearest 
approach? 



##########
helix-rest/src/main/java/org/apache/helix/rest/clusterMaintenanceService/MaintenanceManagementService.java:
##########
@@ -472,9 +472,15 @@ private MaintenanceManagementInstanceInfo 
takeFreeSingleInstanceHelper(String cl
   private List<String> batchHelixInstanceStoppableCheck(String clusterId,
       Collection<String> instances, Map<String, StoppableCheck> 
finalStoppableChecks,
       Set<String> toBeStoppedInstances) {
+
+    // Perform all but min_active replicas check in parallel
     Map<String, Future<StoppableCheck>> helixInstanceChecks = 
instances.stream().collect(
         Collectors.toMap(Function.identity(), instance -> POOL.submit(
             () -> performHelixOwnInstanceCheck(clusterId, instance, 
toBeStoppedInstances))));
+
+    // Perform min_active replicas check sequentially
+    addInstanceMinActiveReplicaCheck(helixInstanceChecks, 
toBeStoppedInstances);

Review Comment:
   `perform the MIN_ACTIVE_REPLICA_CHECK_FAILED first`
   I think we need to perform the min active replica check last as this check 
is dependent on the stoppable status of other nodes. Specifically the result of 
checks instance_not_stable, has_error_partition, and has_disabled_partition 
will determine whether a sibling node will be stopped or not. If a node fails 
these checks, then it should be considered as not being stoppable when we 
perform the min_active check on one of its siblings
   
   2. Good catch, thank you. I didn't account for this. Updated my PR to have a 
guard clause and return early if the min_active_replica check is in the 
`_skipStoppableHealthCheckList`  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to