[
https://issues.apache.org/jira/browse/HBASE-29652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HBASE-29652:
-----------------------------------
Labels: easy-fix pull-request-available (was: easy-fix)
> Chaos testing in ZK mode does not work on hosts with ZNode persistence issue
> ----------------------------------------------------------------------------
>
> Key: HBASE-29652
> URL: https://issues.apache.org/jira/browse/HBASE-29652
> Project: HBase
> Issue Type: Bug
> Components: integration tests
> Affects Versions: 2.6.3, 2.5.12
> Reporter: Jize Ning
> Priority: Minor
> Labels: easy-fix, pull-request-available
>
> Chaos testing in ZK mode involves a client (ChaosZkClient) passing commands
> to agents (chaosAgent). The agents will execute the commands on the host to
> kill/restart hbase processes. If the chaosAgent process on the worker node is
> restarted within a short amount of time, it may fail to register itself. Then
> the chaosAgent will no longer receive any commands from the client.
>
> During chaos testing setup on a worker node, the ChaosAgent will try to
> register an ephemeral ZNode only during initialization
> {code:java}
> /hbase/chaosAgents/<hostname>{code}
> The ChaosZKClient would check its existence before passing commands to the
> agents. If the ZNode is deleted, the ChaosZkClient will lose track of the
> agent and the agent will not receive any commands anymore. This issue could
> also happen when the ChaosAgent process is restarted on the same host. The
> ephemeral ZNode from the first session has not timed out so the agent would
> not recreate it during the second initialization. When the ephemeral ZNode
> eventually times out, the agent would become an orphan without throwing any
> exception.
>
> There is a very simple fix. We can add a Watcher to the ephemeral ZNode
> creation to always recreate it when it gets deleted. This can ensure that the
> chaos agent is always reachable from the ChaosZkClient in its lifecycle.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)