[
https://issues.apache.org/jira/browse/HDDS-10750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17847435#comment-17847435
]
Duong commented on HDDS-10750:
------------------------------
[~szetszwo] [~adoroszlai]
The GroupMismatchException is a funny one. If you look at the logs of a
particular group, you see the Group is closed successfully before the
GroupMismatchException.
{code:java}
2024-05-03 11:32:03,720
[3ff8772d-f587-4fdc-a76a-606c19c41894-PipelineCommandHandlerThread-0] INFO
server.RaftServer$Division (RaftServerImpl.java:lambda$close$1(501)) -
3ff8772d-f587-4fdc-a76a-606c19c41894@group-9B8676AF3B7E: shutdown
2024-05-03 11:32:03,720
[3ff8772d-f587-4fdc-a76a-606c19c41894-PipelineCommandHandlerThread-0] INFO
impl.RoleInfo (RoleInfo.java:shutdownFollowerState(119)) -
3ff8772d-f587-4fdc-a76a-606c19c41894: shutdown
3ff8772d-f587-4fdc-a76a-606c19c41894@group-9B8676AF3B7E-FollowerState
2024-05-03 11:32:03,720
[3ff8772d-f587-4fdc-a76a-606c19c41894@group-9B8676AF3B7E-StateMachineUpdater]
INFO impl.StateMachineUpdater (StateMachineUpdater.java:stop(141)) -
3ff8772d-f587-4fdc-a76a-606c19c41894@group-9B8676AF3B7E-StateMachineUpdater:
closing ContainerStateMachine, lastApplied=(t:0, i:~)
2024-05-03 11:32:03,739
[3ff8772d-f587-4fdc-a76a-606c19c41894-PipelineCommandHandlerThread-0] INFO
segmented.SegmentedRaftLogWorker (SegmentedRaftLogWorker.java:close(248)) -
3ff8772d-f587-4fdc-a76a-606c19c41894@group-9B8676AF3B7E-SegmentedRaftLogWorker
close()
2024-05-03 11:32:03,743
[3ff8772d-f587-4fdc-a76a-606c19c41894-PipelineCommandHandlerThread-0] INFO
server.RaftServer$Division (RaftServerImpl.java:groupRemove(471)) -
3ff8772d-f587-4fdc-a76a-606c19c41894@group-9B8676AF3B7E: Succeed to remove
RaftStorageDirectory Storage Directory
/home/runner/work/ozone/ozone/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-340074cc-bece-47cc-8e54-63befbabe040/ozone-meta/datanode-6/ratis/f0772247-ee5b-45db-b730-9b8676af3b7e
2024-05-03 11:32:03,743
[3ff8772d-f587-4fdc-a76a-606c19c41894-PipelineCommandHandlerThread-0] ERROR
commandhandler.ClosePipelineCommandHandler
(ClosePipelineCommandHandler.java:lambda$handle$2(137)) - Can't close pipeline
PipelineID=f0772247-ee5b-45db-b730-9b8676af3b7e
org.apache.ratis.protocol.exceptions.GroupMismatchException:
3ff8772d-f587-4fdc-a76a-606c19c41894: group-9B8676AF3B7E not found.
at
org.apache.ratis.server.impl.RaftServerProxy$ImplMap.get(RaftServerProxy.java:155)
at
org.apache.ratis.server.impl.RaftServerProxy.getImplFuture(RaftServerProxy.java:365)
at
org.apache.ratis.server.impl.RaftServerProxy.getImpl(RaftServerProxy.java:374)
at
org.apache.ratis.server.impl.RaftServerProxy.getDivision(RaftServerProxy.java:387)
at
org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.getRaftPeersInPipeline(XceiverServerRatis.java:951)
at
org.apache.hadoop.ozone.container.common.statemachine.commandhandler.ClosePipelineCommandHandler.lambda$handle$2(ClosePipelineCommandHandler.java:114)
at
java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
{code}
This seems to come from a recent commit in ozone (HDDS-9959), in
[ClosePipelineCommandHandler#handle|https://github.com/duongkame/ozone/blob/fef3f5d07a846539887d006c055eb8dd152ea348/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/commandhandler/ClosePipelineCommandHandler.java#L108-L108].
The group of the pipeline is first removed, then
ratisServer.getRaftPeersInPipeline() is called and sometimes results in the
GroupMismatchException.
{code:java}
server.removeGroup(pipelineIdProto);
if (server instanceof XceiverServerRatis) {
// TODO: Refactor Ratis logic to XceiverServerRatis
// Propagate the group remove to the other Raft peers in the pipeline
XceiverServerRatis ratisServer = (XceiverServerRatis) server;
final RaftGroupId raftGroupId = RaftGroupId.valueOf(pipelineID.getId());
final Collection<RaftPeer> peers =
ratisServer.getRaftPeersInPipeline(pipelineID); // this causes the
GroupMismatchException{code}
> Intermittent fork timeout while stopping Ratis server
> -----------------------------------------------------
>
> Key: HDDS-10750
> URL: https://issues.apache.org/jira/browse/HDDS-10750
> Project: Apache Ozone
> Issue Type: Sub-task
> Reporter: Attila Doroszlai
> Priority: Critical
> Attachments: 2024-04-21T16-53-06_683-jvmRun1.dump,
> 2024-05-03T11-31-12_561-jvmRun1.dump,
> org.apache.hadoop.fs.ozone.TestOzoneFileChecksum-output.txt,
> org.apache.hadoop.hdds.scm.TestSCMInstallSnapshot-output.txt,
> org.apache.hadoop.ozone.client.rpc.TestECKeyOutputStreamWithZeroCopy-output.txt,
> org.apache.hadoop.ozone.container.TestECContainerRecovery-output.txt,
> org.apache.hadoop.ozone.om.TestOzoneManagerPrepare-output.txt
>
>
> {code:title=https://github.com/adoroszlai/ozone-build-results/blob/master/2024/04/21/30803/it-client/output.log}
> [INFO] Running
> org.apache.hadoop.ozone.client.rpc.TestECKeyOutputStreamWithZeroCopy
> [INFO]
> [INFO] Results:
> ...
> ... There was a timeout or other error in the fork
> {code}
> {code}
> "main"
> java.lang.Thread.State: WAITING
> at java.lang.Object.wait(Native Method)
> at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:405)
> ...
> at
> org.apache.hadoop.ozone.MiniOzoneClusterImpl.stopDatanodes(MiniOzoneClusterImpl.java:473)
> at
> org.apache.hadoop.ozone.MiniOzoneClusterImpl.stop(MiniOzoneClusterImpl.java:414)
> at
> org.apache.hadoop.ozone.MiniOzoneClusterImpl.shutdown(MiniOzoneClusterImpl.java:400)
> at
> org.apache.hadoop.ozone.client.rpc.AbstractTestECKeyOutputStream.shutdown(AbstractTestECKeyOutputStream.java:160)
> "ForkJoinPool.commonPool-worker-7"
> java.lang.Thread.State: TIMED_WAITING
> ...
> at
> java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1475)
> at
> org.apache.ratis.util.ConcurrentUtils.shutdownAndWait(ConcurrentUtils.java:144)
> at
> org.apache.ratis.util.ConcurrentUtils.shutdownAndWait(ConcurrentUtils.java:136)
> at
> org.apache.ratis.server.impl.RaftServerProxy.lambda$close$9(RaftServerProxy.java:438)
> ...
> at
> org.apache.ratis.util.LifeCycle.checkStateAndClose(LifeCycle.java:304)
> at
> org.apache.ratis.server.impl.RaftServerProxy.close(RaftServerProxy.java:415)
> at
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.stop(XceiverServerRatis.java:603)
> at
> org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.stop(OzoneContainer.java:484)
> at
> org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.close(DatanodeStateMachine.java:447)
> at
> org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.stopDaemon(DatanodeStateMachine.java:637)
> at
> org.apache.hadoop.ozone.HddsDatanodeService.stop(HddsDatanodeService.java:550)
> at
> org.apache.hadoop.ozone.MiniOzoneClusterImpl.stopDatanode(MiniOzoneClusterImpl.java:479)
> at
> org.apache.hadoop.ozone.MiniOzoneClusterImpl$$Lambda$2077/645273703.accept(Unknown
> Source)
> "c7edee5d-bf3c-45a7-a783-e11562f208dc-impl-thread2"
> java.lang.Thread.State: WAITING
> ...
> at
> java.util.concurrent.CompletableFuture.join(CompletableFuture.java:1947)
> at
> org.apache.ratis.server.impl.RaftServerImpl.lambda$close$3(RaftServerImpl.java:543)
> at
> org.apache.ratis.server.impl.RaftServerImpl$$Lambda$1925/263251010.run(Unknown
> Source)
> at
> org.apache.ratis.util.LifeCycle.lambda$checkStateAndClose$7(LifeCycle.java:306)
> at org.apache.ratis.util.LifeCycle$$Lambda$1204/655954062.get(Unknown
> Source)
> at
> org.apache.ratis.util.LifeCycle.checkStateAndClose(LifeCycle.java:326)
> at
> org.apache.ratis.util.LifeCycle.checkStateAndClose(LifeCycle.java:304)
> at
> org.apache.ratis.server.impl.RaftServerImpl.close(RaftServerImpl.java:525)
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]