desaikomal commented on code in PR #2639:
URL: https://github.com/apache/helix/pull/2639#discussion_r1344730248
##########
helix-core/src/main/java/org/apache/helix/controller/rebalancer/waged/constraints/ConstraintBasedAlgorithm.java:
##########
@@ -120,24 +120,24 @@ public OptimalAssignment calculate(ClusterModel
clusterModel) throws HelixRebala
private Optional<AssignableNode> getNodeWithHighestPoints(AssignableReplica
replica,
List<AssignableNode> assignableNodes, ClusterContext clusterContext,
Set<String> busyInstances, OptimalAssignment optimalAssignment) {
- Map<AssignableNode, List<HardConstraint>> hardConstraintFailures = new
ConcurrentHashMap<>();
+ Map<AssignableNode, List<String>> hardConstraintFailures = new
ConcurrentHashMap<>();
List<AssignableNode> candidateNodes =
assignableNodes.parallelStream().filter(candidateNode -> {
boolean isValid = true;
// need to record all the failure reasons and it gives us the ability to
debug/fix the runtime
// cluster environment
for (HardConstraint hardConstraint : _hardConstraints) {
- if (!hardConstraint.isAssignmentValid(candidateNode, replica,
clusterContext)) {
+ ValidationResult validationResult =
hardConstraint.isAssignmentValid(candidateNode, replica, clusterContext);
Review Comment:
thanks for confirming it. If i have to summarize your change, it is:
- change the method signature to return ValidationResult. If pass, great but
if it failed, it will produce result.
- For each placement, today we do generate why we couldn't place something
(even though in generic term, we still get the result), so how is this going to
help the customer?
Can we clarify with our customer if this is what we will provide?
```
2023/07/30 00:56:51.769 ERROR [PartialRebalanceRunner] [pool-28-thread-1]
[helix] [] Failed to calculate best possible assignment!
org.apache.helix.HelixRebalanceException: Unable to find any available
candidate node for partition HireIdentity_53;
Fail reasons: {HireIdentity-HireIdentity_53-SLAVE=i
{ ltx1-app64342.prod.linkedin.com_11932=[Node has insufficient capacity],
ltx1-app64356.prod.linkedin.com_11932=[Node has insufficient capacity],
ltx1-app49575.prod.linkedin.com_11932=[A fault zone cannot contain more
than 1 replica of same partition, Node has insufficient capacity],
ltx1-app81203.prod.linkedin.com_11932=[A fault zone cannot contain more
than 1 replica of same partition, Node has insufficient capacity],
ltx1-app81873.prod.linkedin.com_11932=[A fault zone cannot contain more
than 1 replica of same partition, Node has insufficient capacity],
ltx1-app64336.prod.linkedin.com_11932=[Node has insufficient capacity],
ltx1-app82563.prod.linkedin.com_11932=[Node has insufficient capacity],
ltx1-app49842.prod.linkedin.com_11932=[A fault zone cannot contain more
than 1 replica of same partition, Node has insufficient capacity],
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]