[ 
https://issues.apache.org/jira/browse/HDDS-5547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharat Viswanadham updated HDDS-5547:
-------------------------------------
    Description: 
In OM HA, raftGroupID is generated from service ID.
So, if there is a change in OM Service ID OM startup fails with below error


{code:java}
2021-08-05 12:20:03,043 ERROR org.apache.hadoop.ozone.om.OzoneManagerStarter: 
OM start failed with exception
java.io.IOException: java.lang.IllegalStateException: ILLEGAL TRANSITION: In 
OzoneManagerStateMachine:om1:group-8A65FD498CB6, RUNNING -> STARTING
        at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:54)
        at org.apache.ratis.util.IOUtils.toIOException(IOUtils.java:61)
        at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:71)
        at 
org.apache.ratis.server.impl.RaftServerProxy.getImpls(RaftServerProxy.java:354)
        at 
org.apache.ratis.server.impl.RaftServerProxy.start(RaftServerProxy.java:371)
        at 
org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.start(OzoneManagerRatisServer.java:390)
        at org.apache.hadoop.ozone.om.OzoneManager.start(OzoneManager.java:1109)
        at 
org.apache.hadoop.ozone.om.OzoneManagerStarter$OMStarterHelper.start(OzoneManagerStarter.java:126)
        at 
org.apache.hadoop.ozone.om.OzoneManagerStarter.startOm(OzoneManagerStarter.java:79)
        at 
org.apache.hadoop.ozone.om.OzoneManagerStarter.call(OzoneManagerStarter.java:67)
        at 
org.apache.hadoop.ozone.om.OzoneManagerStarter.call(OzoneManagerStarter.java:38)
        at picocli.CommandLine.executeUserObject(CommandLine.java:1933)
        at picocli.CommandLine.access$1100(CommandLine.java:145)
        at 
picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2332)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2326)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2291)
        at 
picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:2152)
        at picocli.CommandLine.parseWithHandlers(CommandLine.java:2530)
        at picocli.CommandLine.parseWithHandler(CommandLine.java:2465)
        at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:96)
        at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:87)
        at 
org.apache.hadoop.ozone.om.OzoneManagerStarter.main(OzoneManagerStarter.java:51)
Caused by: java.lang.IllegalStateException: ILLEGAL TRANSITION: In 
OzoneManagerStateMachine:om1:group-8A65FD498CB6, RUNNING -> STARTING
        at org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:60)
        at org.apache.ratis.util.LifeCycle$State.validate(LifeCycle.java:121)
        at org.apache.ratis.util.LifeCycle.transition(LifeCycle.java:164)
        at 
org.apache.ratis.util.LifeCycle.startAndTransition(LifeCycle.java:268)
        at 
org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.initialize(OzoneManagerStateMachine.java:127)
        at org.apache.ratis.server.impl.ServerState.<init>(ServerState.java:120)
        at 
org.apache.ratis.server.impl.RaftServerImpl.<init>(RaftServerImpl.java:193)
        at 
org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$4(RaftServerProxy.java:266)
        at 
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
{code}

One possible solution is
If a ratis group dir already exists, use that as it is an existing cluster we 
cannot change. For new clusters might be we can use clusterID which does not 
change for a ozone cluster, in this way we shall be tolerant to service id 
config change.

This is just one idea, we can discuss any other approaches to solve this issue 
and fix this.

As right now, in OM we don't allow change of om service id

  was:
In OM HA, raftGroupID is generated from service ID.
So, if there is a change in OM Service ID OM startup fails with below error


{code:java}
2021-08-05 12:20:03,043 ERROR org.apache.hadoop.ozone.om.OzoneManagerStarter: 
OM start failed with exception
java.io.IOException: java.lang.IllegalStateException: ILLEGAL TRANSITION: In 
OzoneManagerStateMachine:om1:group-8A65FD498CB6, RUNNING -> STARTING
        at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:54)
        at org.apache.ratis.util.IOUtils.toIOException(IOUtils.java:61)
        at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:71)
        at 
org.apache.ratis.server.impl.RaftServerProxy.getImpls(RaftServerProxy.java:354)
        at 
org.apache.ratis.server.impl.RaftServerProxy.start(RaftServerProxy.java:371)
        at 
org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.start(OzoneManagerRatisServer.java:390)
        at org.apache.hadoop.ozone.om.OzoneManager.start(OzoneManager.java:1109)
        at 
org.apache.hadoop.ozone.om.OzoneManagerStarter$OMStarterHelper.start(OzoneManagerStarter.java:126)
        at 
org.apache.hadoop.ozone.om.OzoneManagerStarter.startOm(OzoneManagerStarter.java:79)
        at 
org.apache.hadoop.ozone.om.OzoneManagerStarter.call(OzoneManagerStarter.java:67)
        at 
org.apache.hadoop.ozone.om.OzoneManagerStarter.call(OzoneManagerStarter.java:38)
        at picocli.CommandLine.executeUserObject(CommandLine.java:1933)
        at picocli.CommandLine.access$1100(CommandLine.java:145)
        at 
picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2332)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2326)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2291)
        at 
picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:2152)
        at picocli.CommandLine.parseWithHandlers(CommandLine.java:2530)
        at picocli.CommandLine.parseWithHandler(CommandLine.java:2465)
        at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:96)
        at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:87)
        at 
org.apache.hadoop.ozone.om.OzoneManagerStarter.main(OzoneManagerStarter.java:51)
Caused by: java.lang.IllegalStateException: ILLEGAL TRANSITION: In 
OzoneManagerStateMachine:om1:group-8A65FD498CB6, RUNNING -> STARTING
        at org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:60)
        at org.apache.ratis.util.LifeCycle$State.validate(LifeCycle.java:121)
        at org.apache.ratis.util.LifeCycle.transition(LifeCycle.java:164)
        at 
org.apache.ratis.util.LifeCycle.startAndTransition(LifeCycle.java:268)
        at 
org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.initialize(OzoneManagerStateMachine.java:127)
        at org.apache.ratis.server.impl.ServerState.<init>(ServerState.java:120)
        at 
org.apache.ratis.server.impl.RaftServerImpl.<init>(RaftServerImpl.java:193)
        at 
org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$4(RaftServerProxy.java:266)
        at 
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
{code}

One possible solution is
If a ratis group dir already exists, use that as it is an existing cluster we 
cannot change. For new clusters might be we can use clusterID which does not 
change for a ozone cluster.

This is just one idea, we can discuss any other approaches to solve this issue 
and fix this.

As right now, in OM we don't allow change of om service id


> Generate raftgroupId should not depend on service id
> ----------------------------------------------------
>
>                 Key: HDDS-5547
>                 URL: https://issues.apache.org/jira/browse/HDDS-5547
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: Bharat Viswanadham
>            Assignee: Bharat Viswanadham
>            Priority: Major
>
> In OM HA, raftGroupID is generated from service ID.
> So, if there is a change in OM Service ID OM startup fails with below error
> {code:java}
> 2021-08-05 12:20:03,043 ERROR org.apache.hadoop.ozone.om.OzoneManagerStarter: 
> OM start failed with exception
> java.io.IOException: java.lang.IllegalStateException: ILLEGAL TRANSITION: In 
> OzoneManagerStateMachine:om1:group-8A65FD498CB6, RUNNING -> STARTING
>         at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:54)
>         at org.apache.ratis.util.IOUtils.toIOException(IOUtils.java:61)
>         at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:71)
>         at 
> org.apache.ratis.server.impl.RaftServerProxy.getImpls(RaftServerProxy.java:354)
>         at 
> org.apache.ratis.server.impl.RaftServerProxy.start(RaftServerProxy.java:371)
>         at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.start(OzoneManagerRatisServer.java:390)
>         at 
> org.apache.hadoop.ozone.om.OzoneManager.start(OzoneManager.java:1109)
>         at 
> org.apache.hadoop.ozone.om.OzoneManagerStarter$OMStarterHelper.start(OzoneManagerStarter.java:126)
>         at 
> org.apache.hadoop.ozone.om.OzoneManagerStarter.startOm(OzoneManagerStarter.java:79)
>         at 
> org.apache.hadoop.ozone.om.OzoneManagerStarter.call(OzoneManagerStarter.java:67)
>         at 
> org.apache.hadoop.ozone.om.OzoneManagerStarter.call(OzoneManagerStarter.java:38)
>         at picocli.CommandLine.executeUserObject(CommandLine.java:1933)
>         at picocli.CommandLine.access$1100(CommandLine.java:145)
>         at 
> picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2332)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2326)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2291)
>         at 
> picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:2152)
>         at picocli.CommandLine.parseWithHandlers(CommandLine.java:2530)
>         at picocli.CommandLine.parseWithHandler(CommandLine.java:2465)
>         at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:96)
>         at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:87)
>         at 
> org.apache.hadoop.ozone.om.OzoneManagerStarter.main(OzoneManagerStarter.java:51)
> Caused by: java.lang.IllegalStateException: ILLEGAL TRANSITION: In 
> OzoneManagerStateMachine:om1:group-8A65FD498CB6, RUNNING -> STARTING
>         at 
> org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:60)
>         at org.apache.ratis.util.LifeCycle$State.validate(LifeCycle.java:121)
>         at org.apache.ratis.util.LifeCycle.transition(LifeCycle.java:164)
>         at 
> org.apache.ratis.util.LifeCycle.startAndTransition(LifeCycle.java:268)
>         at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.initialize(OzoneManagerStateMachine.java:127)
>         at 
> org.apache.ratis.server.impl.ServerState.<init>(ServerState.java:120)
>         at 
> org.apache.ratis.server.impl.RaftServerImpl.<init>(RaftServerImpl.java:193)
>         at 
> org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$4(RaftServerProxy.java:266)
>         at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> {code}
> One possible solution is
> If a ratis group dir already exists, use that as it is an existing cluster we 
> cannot change. For new clusters might be we can use clusterID which does not 
> change for a ozone cluster, in this way we shall be tolerant to service id 
> config change.
> This is just one idea, we can discuss any other approaches to solve this 
> issue and fix this.
> As right now, in OM we don't allow change of om service id



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to