Re: [PR] [apache/helix] -- Fixes #2638, Improve Hard Constraint Failure Debuggability by adding details in the error message [helix]

via GitHub Tue, 03 Oct 2023 13:57:34 -0700


desaikomal commented on code in PR #2639:
URL: https://github.com/apache/helix/pull/2639#discussion_r1344730248



##########
helix-core/src/main/java/org/apache/helix/controller/rebalancer/waged/constraints/ConstraintBasedAlgorithm.java:
##########
@@ -120,24 +120,24 @@ public OptimalAssignment calculate(ClusterModel 
clusterModel) throws HelixRebala
   private Optional<AssignableNode> getNodeWithHighestPoints(AssignableReplica 
replica,
       List<AssignableNode> assignableNodes, ClusterContext clusterContext,
       Set<String> busyInstances, OptimalAssignment optimalAssignment) {
-    Map<AssignableNode, List<HardConstraint>> hardConstraintFailures = new 
ConcurrentHashMap<>();
+    Map<AssignableNode, List<String>> hardConstraintFailures = new 
ConcurrentHashMap<>();
     List<AssignableNode> candidateNodes = 
assignableNodes.parallelStream().filter(candidateNode -> {
       boolean isValid = true;
       // need to record all the failure reasons and it gives us the ability to 
debug/fix the runtime
       // cluster environment
       for (HardConstraint hardConstraint : _hardConstraints) {
-        if (!hardConstraint.isAssignmentValid(candidateNode, replica, 
clusterContext)) {
+        ValidationResult validationResult  = 
hardConstraint.isAssignmentValid(candidateNode, replica, clusterContext);

Review Comment:
   thanks for confirming it.  If i have to summarize your change, it is:
   - change the method signature to return ValidationResult. If pass, great but 
if it failed, it will produce result. 
   - For each placement, today we do generate why we couldn't place something 
(even though in generic term, we still get the result), so how is this going to 
help the customer?
   
   Can we clarify with our customer if this is what we will provide?
   
   ```
   2023/07/30 00:56:51.769 ERROR [PartialRebalanceRunner] [pool-28-thread-1] 
[helix] [] Failed to calculate best possible assignment!
   org.apache.helix.HelixRebalanceException: Unable to find any available 
candidate node for partition HireIdentity_53; 
   Fail reasons: {HireIdentity-HireIdentity_53-SLAVE=i
   {   ltx1-app64342.prod.linkedin.com_11932=[Node has insufficient capacity], 
       ltx1-app64356.prod.linkedin.com_11932=[Node has insufficient capacity], 
      ltx1-app49575.prod.linkedin.com_11932=[A fault zone cannot contain more 
than 1 replica of same partition, Node has insufficient capacity], 
      ltx1-app81203.prod.linkedin.com_11932=[A fault zone cannot contain more 
than 1 replica of same partition, Node has insufficient capacity], 
      ltx1-app81873.prod.linkedin.com_11932=[A fault zone cannot contain more 
than 1 replica of same partition, Node has insufficient capacity],  
      ltx1-app64336.prod.linkedin.com_11932=[Node has insufficient capacity],  
      ltx1-app82563.prod.linkedin.com_11932=[Node has insufficient capacity],
      ltx1-app49842.prod.linkedin.com_11932=[A fault zone cannot contain more 
than 1 replica of same partition, Node has insufficient capacity], 
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [apache/helix] -- Fixes #2638, Improve Hard Constraint Failure Debuggability by adding details in the error message [helix]

Reply via email to