[ 
https://issues.apache.org/jira/browse/HBASE-29652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HBASE-29652:
-----------------------------------
    Labels: easy-fix pull-request-available  (was: easy-fix)

> Chaos testing in ZK mode does not work on hosts with ZNode persistence issue
> ----------------------------------------------------------------------------
>
>                 Key: HBASE-29652
>                 URL: https://issues.apache.org/jira/browse/HBASE-29652
>             Project: HBase
>          Issue Type: Bug
>          Components: integration tests
>    Affects Versions: 2.6.3, 2.5.12
>            Reporter: Jize Ning
>            Priority: Minor
>              Labels: easy-fix, pull-request-available
>
> Chaos testing in ZK mode involves a client (ChaosZkClient) passing commands 
> to agents (chaosAgent). The agents will execute the commands on the host to 
> kill/restart hbase processes. If the chaosAgent process on the worker node is 
> restarted within a short amount of time, it may fail to register itself. Then 
> the chaosAgent will no longer receive any commands from the client. 
>  
> During chaos testing setup on a worker node, the ChaosAgent will try to 
> register an ephemeral ZNode only during initialization
> {code:java}
> /hbase/chaosAgents/<hostname>{code}
> The ChaosZKClient would check its existence before passing commands to the 
> agents. If the ZNode is deleted, the ChaosZkClient will lose track of the 
> agent and the agent will not receive any commands anymore. This issue could 
> also happen when the ChaosAgent process is restarted on the same host. The 
> ephemeral ZNode from the first session has not timed out so the agent would 
> not recreate it during the second initialization. When the ephemeral ZNode 
> eventually times out, the agent would become an orphan without throwing any 
> exception. 
>  
> There is a very simple fix. We can add a Watcher to the ephemeral ZNode 
> creation to always recreate it when it gets deleted. This can ensure that the 
> chaos agent is always reachable from the ChaosZkClient in its lifecycle.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to