codeon opened a new issue #683: Inconsistency in the nodes tracked 
URL: https://github.com/apache/helix/issues/683
 
 
   Hi, 
   
   We are seeing this particular issue, where our nodes are not able to start 
the participant process and the de-registration process fails with the 
following exception:
   
   ```
   2019-10-03 05:31:28 [core-thread-12] ERROR 
c.uber.streamgate.helix.Participant - Exception while unregistering helix 
participant
   org.apache.helix.HelixException: Node dca1-prod05_streamgate_shadow_0 does 
not exist in config for cluster StreamgateClusterV1-DCA1-Shadow
    at 
org.apache.helix.manager.zk.ZKHelixAdmin.dropInstance(ZKHelixAdmin.java:129)
    at c.u.s.helix.Participant.unregister(Participant.java:134)
    at c.u.s.helix.Participant.registerInstance(Participant.java:109)
    at c.u.s.helix.Participant.run(Participant.java:61)
    at 
c.u.s.http.endpoints.HelixRegister.lambda$doHandle$0(HelixRegister.java:57)
    at 
io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
    at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
    at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:309)
    at 
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
    at java.lang.Thread.run(Thread.java:748)
    ```
   
    However, when the node tries to register itself, it gets an exception 
saying that the node is already registered.
   
    From our understanding, it happens when under some extreme circumstances 
(like a flappy node restarting quickly), instance information goes away from 
   
   `/<HELIX_CLUSTER_NAME>/CONFIGS/PARTICIPANT` path but gets stuck in 
`/<HELIX_CLUSTER_NAME>/INSTANCES/` path. Then all helix commands, 
register/unregister/delete/disable fail. 
   
   In order to fix it, we remove the node from 
`/<HELIX_CLUSTER_NAME>/INSTANCES/` path manually, and restart controller 
processes and the participant nodes so they can register cleanly again.
   
   We wanted to understand when can such a situation arise when the instance is 
cleaned up from one path but remains in another leading to inconsistency.
   
   
   Helix Version : 0.8.2
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to