jacob-netguardians opened a new issue, #2965:
URL: https://github.com/apache/helix/issues/2965
### Describe the bug
When two nodes are registered very closely to eachother, it can happen that
the ZKHelixAdmin.addInstance() method is called at about the same time on the
two nodes. The one that comes slightly before the other one will start creating
ZNodes for the "INSTANCE" hierarchy in Zookeeper.
Let's say the second node starts the run of the ZKHelixAdmin.addInstance()
method now. Its call to findInstancesWithMatchingLogicalId() will fail badly,
throwing an exception (see below), because the first node's INSTANCE hierarchy
is not complete:
```
[o.a.h.m.z.ZKHelixAdmin#202] Add instance node-2 to cluster ecwbq5PO.
[o.a.h.m.z.ZKUtil#163] Invalid instance setup, missing znode path:
/ecwbq5PO/INSTANCES/node-1/MESSAGES
[o.a.h.m.z.ZKUtil#163] Invalid instance setup, missing znode path:
/ecwbq5PO/INSTANCES/node-1/CURRENTSTATES
[o.a.h.m.z.ZKUtil#163] Invalid instance setup, missing znode path:
/ecwbq5PO/INSTANCES/node-1/STATUSUPDATES
[o.a.h.m.z.ZKUtil#163] Invalid instance setup, missing znode path:
/ecwbq5PO/INSTANCES/node-1/ERRORS
[o.a.h.m.z.ZKHelixAdmin#232] Failed to add instance node-2 to cluster
ecwbq5PO with instance operation ENABLE. Setting INSTANCE_OPERATION to UNKNOWN
instead.
org.apache.helix.HelixException: fail to get config. instance: node-1 is NOT
setup in cluster: ecwbq5PO
at
org.apache.helix.ConfigAccessor.getInstanceConfig(ConfigAccessor.java:908)
~[helix-core-1.4.1.1.jar:1.4.1.1]
at
org.apache.helix.util.InstanceUtil.lambda$findInstancesWithMatchingLogicalId$7(InstanceUtil.java:143)
~[helix-core-1.4.1.1.jar:1.4.1.1]
at
java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
~[?:?]
at
java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625)
~[?:?]
at
java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
~[?:?]
at
java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
~[?:?]
at
java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
~[?:?]
at
java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
~[?:?]
at
java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682)
~[?:?]
at
org.apache.helix.util.InstanceUtil.findInstancesWithMatchingLogicalId(InstanceUtil.java:148)
~[helix-core-1.4.1.1.jar:1.4.1.1]
at
org.apache.helix.util.InstanceUtil.validateInstanceOperationTransition(InstanceUtil.java:118)
~[helix-core-1.4.1.1.jar:1.4.1.1]
at
org.apache.helix.manager.zk.ZKHelixAdmin.addInstance(ZKHelixAdmin.java:228)
~[helix-core-1.4.1.1.jar:1.4.1.1]
...
```
So... while adding node-2, it complains about node-1 being invalid, and
consequently sets the "instance-operation" on node-2 to UNKNOWN instead of
ENABLE.
Version of helix (1.4.1.1) above is due to the fact that I had already fixed
#2963 and had a local version number change for this. Line numbers in the logs
above may also not exactly match what is known in version 1.4.1, due to the
addition of debug logging I had to do before actually understanding the problem.
### Expected behavior
A node should be registered completely or not at all, not having that kind
of impact on the fellow nodes' registration.
### Additional context
I propose the following fix, in the ZKHelixAdmin class, creating all those
nodes in the node's INSTANCE hierarchy in an atomic fashion:
Replace, in the ZKHelixAdmin class, addInstance() method, this:
```
_zkClient.createPersistent(PropertyPathBuilder.instanceMessage(clusterName,
nodeId), true);
_zkClient.createPersistent(PropertyPathBuilder.instanceCurrentState(clusterName,
nodeId), true);
_zkClient
.createPersistent(PropertyPathBuilder.instanceTaskCurrentState(clusterName,
nodeId), true);
_zkClient.createPersistent(PropertyPathBuilder.instanceCustomizedState(clusterName,
nodeId), true);
_zkClient.createPersistent(PropertyPathBuilder.instanceError(clusterName,
nodeId), true);
_zkClient.createPersistent(PropertyPathBuilder.instanceStatusUpdate(clusterName,
nodeId), true);
_zkClient.createPersistent(PropertyPathBuilder.instanceHistory(clusterName,
nodeId), true);
```
with
```
// if those nodes are not created atomically, then having two nodes
registering almost exactly
// at the same time could be problematic, because one node would not
able to check for already
// existing instance with matching logical ID (see
// InstanceUtil.findInstancesWithMatchingLogicalId call above, where
finding an incomplete
// "INSTANCE" node is a killer)
final List<Op> ops = Arrays.asList(
createPersistentNodeOp(PropertyPathBuilder.instance(clusterName,
nodeId)),
createPersistentNodeOp(PropertyPathBuilder.instanceMessage(clusterName,
nodeId)),
createPersistentNodeOp(PropertyPathBuilder.instanceCurrentState(clusterName,
nodeId)),
createPersistentNodeOp(PropertyPathBuilder.instanceTaskCurrentState(clusterName,
nodeId)),
createPersistentNodeOp(PropertyPathBuilder.instanceCustomizedState(clusterName,
nodeId)),
createPersistentNodeOp(PropertyPathBuilder.instanceError(clusterName, nodeId)),
createPersistentNodeOp(PropertyPathBuilder.instanceStatusUpdate(clusterName,
nodeId)),
createPersistentNodeOp(PropertyPathBuilder.instanceHistory(clusterName, nodeId))
);
_zkClient.multi(ops);
```
where `createPersistentNodeOp` method is defined as:
```
private static Op createPersistentNodeOp(String path) {
return Op.create(path, null, ZooDefs.Ids.OPEN_ACL_UNSAFE,
CreateMode.PERSISTENT);
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]