Bharat Viswanadham created HDDS-5546:
----------------------------------------
Summary: OM Service ID change causes OM startup failure
Key: HDDS-5546
URL: https://issues.apache.org/jira/browse/HDDS-5546
Project: Apache Ozone
Issue Type: Bug
Reporter: Bharat Viswanadham
Assignee: Bharat Viswanadham
In OM HA, raftGroupID is generated from service ID.
So, if there is a change in OM Service ID OM startup fails with below error
{code:java}
2021-08-05 12:20:03,043 ERROR org.apache.hadoop.ozone.om.OzoneManagerStarter:
OM start failed with exception
java.io.IOException: java.lang.IllegalStateException: ILLEGAL TRANSITION: In
OzoneManagerStateMachine:om1:group-8A65FD498CB6, RUNNING -> STARTING
at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:54)
at org.apache.ratis.util.IOUtils.toIOException(IOUtils.java:61)
at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:71)
at
org.apache.ratis.server.impl.RaftServerProxy.getImpls(RaftServerProxy.java:354)
at
org.apache.ratis.server.impl.RaftServerProxy.start(RaftServerProxy.java:371)
at
org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.start(OzoneManagerRatisServer.java:390)
at org.apache.hadoop.ozone.om.OzoneManager.start(OzoneManager.java:1109)
at
org.apache.hadoop.ozone.om.OzoneManagerStarter$OMStarterHelper.start(OzoneManagerStarter.java:126)
at
org.apache.hadoop.ozone.om.OzoneManagerStarter.startOm(OzoneManagerStarter.java:79)
at
org.apache.hadoop.ozone.om.OzoneManagerStarter.call(OzoneManagerStarter.java:67)
at
org.apache.hadoop.ozone.om.OzoneManagerStarter.call(OzoneManagerStarter.java:38)
at picocli.CommandLine.executeUserObject(CommandLine.java:1933)
at picocli.CommandLine.access$1100(CommandLine.java:145)
at
picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2332)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2326)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2291)
at
picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:2152)
at picocli.CommandLine.parseWithHandlers(CommandLine.java:2530)
at picocli.CommandLine.parseWithHandler(CommandLine.java:2465)
at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:96)
at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:87)
at
org.apache.hadoop.ozone.om.OzoneManagerStarter.main(OzoneManagerStarter.java:51)
Caused by: java.lang.IllegalStateException: ILLEGAL TRANSITION: In
OzoneManagerStateMachine:om1:group-8A65FD498CB6, RUNNING -> STARTING
at org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:60)
at org.apache.ratis.util.LifeCycle$State.validate(LifeCycle.java:121)
at org.apache.ratis.util.LifeCycle.transition(LifeCycle.java:164)
at
org.apache.ratis.util.LifeCycle.startAndTransition(LifeCycle.java:268)
at
org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.initialize(OzoneManagerStateMachine.java:127)
at org.apache.ratis.server.impl.ServerState.<init>(ServerState.java:120)
at
org.apache.ratis.server.impl.RaftServerImpl.<init>(RaftServerImpl.java:193)
at
org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$4(RaftServerProxy.java:266)
at
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{code}
The reason is now a new ratis group directory is created and StateMachine
instance is shared between them. The error is confusing to end users as it is
not clear that it is due to change in OM serviceId this caused failure.
This Jira is to add some safeguard code and give clear message to know om
startup failure. I will raise another jira to not to use om service id in ratis
group ID.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]