Bryan Beaudreault created HBASE-28365:
-----------------------------------------
Summary: ChaosMonkey batch suspend/resume action assume shell
implementation
Key: HBASE-28365
URL: https://issues.apache.org/jira/browse/HBASE-28365
Project: HBase
Issue Type: Bug
Reporter: Bryan Beaudreault
These two actions have code like this:
{code:java}
case SUSPEND:
server = serversToBeSuspended.remove();
try {
suspendRs(server);
} catch (Shell.ExitCodeException e) {
LOG.warn("Problem suspending but presume successful; code={}",
e.getExitCode(), e);
}
suspendedServers.add(server);
break; {code}
This only catches that one Shell.ExitCodeException, but operators may have an
implementation of ClusterManager which does not use shell. We should expand
this to catch all exceptions.
The implication here is that the uncaught exception propagates, and we don't
add the server to suspendedServers. If the suspension actually succeeded, this
leaves some processes in a permanently suspended state until manual
intervention occurs.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)