jacob-netguardians opened a new issue, #2965:
URL: https://github.com/apache/helix/issues/2965

   ### Describe the bug
   
   When two nodes are registered very closely to eachother, it can happen that 
the ZKHelixAdmin.addInstance() method is called at about the same time on the 
two nodes. The one that comes slightly before the other one will start creating 
ZNodes for the "INSTANCE" hierarchy in Zookeeper.
   
   Let's say the second node starts the run of the ZKHelixAdmin.addInstance() 
method now. Its call to findInstancesWithMatchingLogicalId() will fail badly, 
throwing an exception (see below), because the first node's INSTANCE hierarchy 
is not complete:
   
   ```
   [o.a.h.m.z.ZKHelixAdmin#202] Add instance node-2 to cluster ecwbq5PO.
   [o.a.h.m.z.ZKUtil#163] Invalid instance setup, missing znode path: 
/ecwbq5PO/INSTANCES/node-1/MESSAGES
   [o.a.h.m.z.ZKUtil#163] Invalid instance setup, missing znode path: 
/ecwbq5PO/INSTANCES/node-1/CURRENTSTATES
   [o.a.h.m.z.ZKUtil#163] Invalid instance setup, missing znode path: 
/ecwbq5PO/INSTANCES/node-1/STATUSUPDATES
   [o.a.h.m.z.ZKUtil#163] Invalid instance setup, missing znode path: 
/ecwbq5PO/INSTANCES/node-1/ERRORS
   [o.a.h.m.z.ZKHelixAdmin#232] Failed to add instance node-2 to cluster 
ecwbq5PO with instance operation ENABLE. Setting INSTANCE_OPERATION to UNKNOWN 
instead.
   org.apache.helix.HelixException: fail to get config. instance: node-1 is NOT 
setup in cluster: ecwbq5PO
        at 
org.apache.helix.ConfigAccessor.getInstanceConfig(ConfigAccessor.java:908) 
~[helix-core-1.4.1.1.jar:1.4.1.1]
        at 
org.apache.helix.util.InstanceUtil.lambda$findInstancesWithMatchingLogicalId$7(InstanceUtil.java:143)
 ~[helix-core-1.4.1.1.jar:1.4.1.1]
        at 
java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
 ~[?:?]
        at 
java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625)
 ~[?:?]
        at 
java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509) 
~[?:?]
        at 
java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
 ~[?:?]
        at 
java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
 ~[?:?]
        at 
java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) 
~[?:?]
        at 
java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682)
 ~[?:?]
        at 
org.apache.helix.util.InstanceUtil.findInstancesWithMatchingLogicalId(InstanceUtil.java:148)
 ~[helix-core-1.4.1.1.jar:1.4.1.1]
        at 
org.apache.helix.util.InstanceUtil.validateInstanceOperationTransition(InstanceUtil.java:118)
 ~[helix-core-1.4.1.1.jar:1.4.1.1]
        at 
org.apache.helix.manager.zk.ZKHelixAdmin.addInstance(ZKHelixAdmin.java:228) 
~[helix-core-1.4.1.1.jar:1.4.1.1]
   ...
   ```
   
   So... while adding node-2, it complains about node-1 being invalid, and 
consequently sets the "instance-operation" on node-2 to UNKNOWN instead of 
ENABLE.
   
   Version of helix (1.4.1.1) above is due to the fact that I had already fixed 
#2963 and had a local version number change for this. Line numbers in the logs 
above may also not exactly match what is known in version 1.4.1, due to the 
addition of debug logging I had to do before actually understanding the problem.
   
   ### Expected behavior
   
   A node should be registered completely or not at all, not having that kind 
of impact on the fellow nodes' registration.
   
   ### Additional context
   
   I propose the following fix, in the ZKHelixAdmin class, creating all those 
nodes in the node's INSTANCE hierarchy in an atomic fashion:
   
   Replace, in the ZKHelixAdmin class, addInstance() method, this:
   
   ```
       
_zkClient.createPersistent(PropertyPathBuilder.instanceMessage(clusterName, 
nodeId), true);
       
_zkClient.createPersistent(PropertyPathBuilder.instanceCurrentState(clusterName,
 nodeId), true);
       _zkClient
           
.createPersistent(PropertyPathBuilder.instanceTaskCurrentState(clusterName, 
nodeId), true);
       
_zkClient.createPersistent(PropertyPathBuilder.instanceCustomizedState(clusterName,
 nodeId), true);
       
_zkClient.createPersistent(PropertyPathBuilder.instanceError(clusterName, 
nodeId), true);
       
_zkClient.createPersistent(PropertyPathBuilder.instanceStatusUpdate(clusterName,
 nodeId), true);
       
_zkClient.createPersistent(PropertyPathBuilder.instanceHistory(clusterName, 
nodeId), true);
   ```
   
   with
   
   ```
       // if those nodes are not created atomically, then having two nodes 
registering almost exactly
       // at the same time could be problematic, because one node would not 
able to check for already
       // existing instance with matching logical ID (see
       // InstanceUtil.findInstancesWithMatchingLogicalId call above, where 
finding an incomplete
       // "INSTANCE" node is a killer)
       final List<Op> ops = Arrays.asList(
           createPersistentNodeOp(PropertyPathBuilder.instance(clusterName, 
nodeId)),
           
createPersistentNodeOp(PropertyPathBuilder.instanceMessage(clusterName, 
nodeId)),
           
createPersistentNodeOp(PropertyPathBuilder.instanceCurrentState(clusterName, 
nodeId)),
           
createPersistentNodeOp(PropertyPathBuilder.instanceTaskCurrentState(clusterName,
 nodeId)),
           
createPersistentNodeOp(PropertyPathBuilder.instanceCustomizedState(clusterName, 
nodeId)),
           
createPersistentNodeOp(PropertyPathBuilder.instanceError(clusterName, nodeId)),
           
createPersistentNodeOp(PropertyPathBuilder.instanceStatusUpdate(clusterName, 
nodeId)),
           
createPersistentNodeOp(PropertyPathBuilder.instanceHistory(clusterName, nodeId))
       );
       _zkClient.multi(ops);
   ```
   
   where `createPersistentNodeOp` method is defined as:
   
   ```
     private static Op createPersistentNodeOp(String path) {
       return Op.create(path, null, ZooDefs.Ids.OPEN_ACL_UNSAFE, 
CreateMode.PERSISTENT);
     }
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to