[ 
https://issues.apache.org/jira/browse/RATIS-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17794943#comment-17794943
 ] 

Attila Doroszlai commented on RATIS-1958:
-----------------------------------------

OM service ID was changed.

{code}
2023-12-08 12:07:10,958 INFO 
[main]-org.apache.hadoop.ozone.om.ha.OMHANodeDetails: Found matching OM address 
with OMServiceId: ozone1702065844, OMNodeId: om27, RPC Address: 
ccycloud-1.weichiu.root.comops.site:9862 and Ratis port: 9872

...

2023-12-08 14:26:58,539 INFO 
[main]-org.apache.hadoop.ozone.om.ha.OMHANodeDetails: Found matching OM address 
with OMServiceId: ozone1, OMNodeId: om27, RPC Address: 
ccycloud-1.weichiu.root.comops.site:9862 and Ratis port: 9872
{code}

OM has an existing Raft dir, but also uses a new Raft dir.  Hence two groups, 
but only one state machine, which doesn't seem to handle multi-raft.

{code}
2023-12-08 14:27:14,514 INFO 
[om27-impl-thread1]-org.apache.ratis.server.RaftServer: om27: found a 
subdirectory /var/lib/hadoop-ozone/om/ratis/32905ab7-c4c1-3746-8bf3-786de9b22639
2023-12-08 14:27:14,517 INFO 
[om27-impl-thread1]-org.apache.ratis.server.RaftServer: om27: addNew 
group-786DE9B22639:[]...

2023-12-08 14:27:14,518 INFO [main]-org.apache.ratis.server.RaftServer: om27: 
addNew group-9F198C4C3682:...

2023-12-08 14:27:14,527 INFO 
[om27-groupManagement]-org.apache.ratis.server.RaftServer$Division: om27: new 
RaftServerImpl for group-786DE9B22639:...
2023-12-08 14:27:14,580 INFO 
[om27-groupManagement]-org.apache.ratis.server.RaftServer$Division: om27: new 
RaftServerImpl for group-9F198C4C3682:...

2023-12-08 14:27:15,216 INFO 
[om27-impl-thread2]-org.apache.ratis.server.storage.RaftStorageDirectory: The 
storage directory 
/var/lib/hadoop-ozone/om/ratis/d39ebeec-41e8-35f1-a92b-9f198c4c3682 does not 
exist. Creating ...
{code}

> ILLEGAL TRANSITION RUNNING -> STARTING
> --------------------------------------
>
>                 Key: RATIS-1958
>                 URL: https://issues.apache.org/jira/browse/RATIS-1958
>             Project: Ratis
>          Issue Type: Bug
>            Reporter: Wei-Chiu Chuang
>            Priority: Major
>         Attachments: om_illegal_transition.tgz, ozone-om.log
>
>
> Seen this error on a new Ozone cluster and OM crashed, unable to restart. 
> (Version Cloudera CDP 7.1.9). Notably, I've seen this error twice in a week 
> on separate clusters.
>  
> {code:java}
> 2023-12-08 14:27:15,265 ERROR 
> [main]-org.apache.hadoop.ozone.om.OzoneManagerStarter: OM start failed with 
> exception
> java.util.concurrent.CompletionException: java.lang.IllegalStateException: 
> ILLEGAL TRANSITION: In OzoneManagerStateMachine:om27:group-9F198C4C3682, 
> RUNNING -> STARTING
>         at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
>         at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
>         at 
> java.util.concurrent.CompletableFuture.biRelay(CompletableFuture.java:1298)
>         at 
> java.util.concurrent.CompletableFuture$BiRelay.tryFire(CompletableFuture.java:1284)
>         at 
> java.util.concurrent.CompletableFuture$CoCompletion.tryFire(CompletableFuture.java:1034)
>         at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
>         at 
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
>         at 
> org.apache.ratis.util.ConcurrentUtils.accept(ConcurrentUtils.java:189)
>         at 
> org.apache.ratis.util.ConcurrentUtils.lambda$null$4(ConcurrentUtils.java:180)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.IllegalStateException: ILLEGAL TRANSITION: In 
> OzoneManagerStateMachine:om27:group-9F198C4C3682, RUNNING -> STARTING
>         at 
> org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:60)
>         at org.apache.ratis.util.LifeCycle$State.validate(LifeCycle.java:121)
>         at org.apache.ratis.util.LifeCycle.transition(LifeCycle.java:164)
>         at 
> org.apache.ratis.util.LifeCycle.startAndTransition(LifeCycle.java:268)
>         at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.initialize(OzoneManagerStateMachine.java:140)
>         at 
> org.apache.ratis.server.impl.ServerState.initialize(ServerState.java:173)
>         at 
> org.apache.ratis.server.impl.RaftServerImpl.start(RaftServerImpl.java:338)
>         at 
> org.apache.ratis.util.ConcurrentUtils.accept(ConcurrentUtils.java:188)
>         ... 4 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to