[ 
https://issues.apache.org/jira/browse/HDDS-8945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siyao Meng updated HDDS-8945:
-----------------------------
    Attachment: HDDS-8945.debug.001.diff

> Investigate potential Ratis issue or incorrect usage in OM integrationt test
> ----------------------------------------------------------------------------
>
>                 Key: HDDS-8945
>                 URL: https://issues.apache.org/jira/browse/HDDS-8945
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: Siyao Meng
>            Priority: Major
>         Attachments: HDDS-8945.debug.001.diff
>
>
> The integration test `TestOmSnapshotDisabledRestart` added in HDDS-8555 would 
> occasionally pass, but most of the time it would fail at the OM restart step 
> because om.stop() is stuck at Ratis retrying to quit, resulting in OM main 
> RocksDB LOCK not being released when it is restarting, printing a ton of 
> {{Failed appendEntries: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
> exception}} WARM message before it times out:
> {code}
> 2023-06-27 20:36:27,174 [omNode-3-impl-thread1] INFO  impl.RoleInfo 
> (RoleInfo.java:shutdownFollowerState(110)) - omNode-3: shutdown 
> omNode-3@group-6E8B7B074D31-FollowerState
> 2023-06-27 20:36:27,174 [omNode-3@group-6E8B7B074D31-FollowerState] INFO  
> impl.FollowerState (FollowerState.java:run(152)) - 
> omNode-3@group-6E8B7B074D31-FollowerState was interrupted
> 2023-06-27 20:36:27,174 [omNode-3-impl-thread1] INFO  
> impl.StateMachineUpdater (StateMachineUpdater.java:stopAndJoin(155)) - 
> omNode-3@group-6E8B7B074D31-StateMachineUpdater: set stopIndex = 5
> 2023-06-27 20:36:27,175 [OM StateMachine ApplyTransaction Thread - 0] INFO  
> snapshot.OMSnapshotCreateRequest 
> (OMSnapshotCreateRequest.java:validateAndUpdateCache(193)) - Created 
> snapshot: 'snap-48567' with snapshotId: 
> '4714bd6c-6f41-4d7e-8a8e-f31f8cb18aaa' under path 'vol-47883/buck-56159'
> 2023-06-27 20:36:27,175 [omNode-3@group-6E8B7B074D31-StateMachineUpdater] 
> INFO  ratis.OzoneManagerStateMachine 
> (OzoneManagerStateMachine.java:takeSnapshot(479)) - Current Snapshot Index 
> (t:1, i:4)
> 2023-06-27 20:36:27,178 [OM StateMachine ApplyTransaction Thread - 0] INFO  
> snapshot.OMSnapshotCreateRequest 
> (OMSnapshotCreateRequest.java:validateAndUpdateCache(193)) - Created 
> snapshot: 'snap-48567' with snapshotId: 
> '4714bd6c-6f41-4d7e-8a8e-f31f8cb18aaa' under path 'vol-47883/buck-56159'
> 2023-06-27 20:36:27,180 [grpc-default-executor-6] WARN  
> server.GrpcServerProtocolService (LogUtils.java:warn(122)) - omNode-3: 
> APPEND_ENTRIES onError, lastRequest: null: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: client 
> cancelled
> 2023-06-27 20:36:27,180 [grpc-default-executor-2] WARN  
> server.GrpcServerProtocolService (LogUtils.java:warn(122)) - omNode-3: 
> APPEND_ENTRIES onError, lastRequest: omNode-1->omNode-3#38-t1,previous=(t:1, 
> i:5),leaderCommit=5,initializing? true,entries: size=1, first=(t:1, i:6), 
> METADATAENTRY(c:5): 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: client 
> cancelled
> 2023-06-27 20:36:27,180 [grpc-default-executor-5] WARN  
> server.GrpcServerProtocolService (LogUtils.java:warn(122)) - omNode-3: 
> APPEND_ENTRIES onError, lastRequest: null: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: client 
> cancelled
> 2023-06-27 20:36:27,181 [main] INFO  server.GrpcService 
> (GrpcService.java:closeImpl(320)) - omNode-3: shutdown server 
> GrpcServerProtocolService successfully
> 2023-06-27 20:36:27,181 [grpc-default-executor-5] WARN  
> server.GrpcLogAppender (LogUtils.java:warn(122)) - 
> omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed 
> appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
> CANCELLED: RST_STREAM closed stream. HTTP/2 error code: CANCEL
> 2023-06-27 20:36:27,181 [grpc-default-executor-2] WARN  
> server.GrpcLogAppender (LogUtils.java:warn(122)) - 
> omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed 
> appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
> CANCELLED: RST_STREAM closed stream. HTTP/2 error code: CANCEL
> 2023-06-27 20:36:27,181 [grpc-default-executor-6] WARN  
> server.GrpcLogAppender (LogUtils.java:warn(122)) - 
> omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed 
> appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
> CANCELLED: RST_STREAM closed stream. HTTP/2 error code: CANCEL
> 2023-06-27 20:36:27,182 [grpc-default-executor-5] INFO  leader.FollowerInfo 
> (FollowerInfoImpl.java:lambda$new$0(52)) - 
> omNode-1@group-6E8B7B074D31->omNode-3: decreaseNextIndex nextIndex: 
> updateUnconditionally 7 -> 6
> 2023-06-27 20:36:27,182 [grpc-default-executor-2] INFO  leader.FollowerInfo 
> (FollowerInfoImpl.java:lambda$new$0(52)) - 
> omNode-1@group-6E8B7B074D31->omNode-3: decreaseNextIndex nextIndex: 
> updateUnconditionally 6 -> 5
> 2023-06-27 20:36:27,182 [grpc-default-executor-6] INFO  leader.FollowerInfo 
> (FollowerInfoImpl.java:lambda$new$0(52)) - 
> omNode-1@group-6E8B7B074D31->omNode-3: decreaseNextIndex nextIndex: 
> updateUnconditionally 5 -> 4
> 2023-06-27 20:36:27,349 [omNode-3@group-6E8B7B074D31-StateMachineUpdater] 
> INFO  impl.StateMachineUpdater (StateMachineUpdater.java:takeSnapshot(291)) - 
> omNode-3@group-6E8B7B074D31-StateMachineUpdater: Took a snapshot at index 4
> 2023-06-27 20:36:27,349 [omNode-3@group-6E8B7B074D31-StateMachineUpdater] 
> INFO  impl.StateMachineUpdater (StateMachineUpdater.java:lambda$new$0(97)) - 
> omNode-3@group-6E8B7B074D31-StateMachineUpdater: snapshotIndex: 
> updateIncreasingly -1 -> 4
> 2023-06-27 20:36:27,350 [omNode-3@group-6E8B7B074D31-StateMachineUpdater] 
> INFO  ratis.OzoneManagerStateMachine 
> (OzoneManagerStateMachine.java:takeSnapshot(479)) - Current Snapshot Index 
> (t:1, i:4)
> 2023-06-27 20:36:27,365 [OMDoubleBufferFlushThread] INFO  
> db.RDBCheckpointManager (RDBCheckpointManager.java:createCheckpoint(89)) - 
> Created checkpoint in rocksDB at 
> /Users/smeng/repo/ozone-master/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-5ed4ff4b-1eeb-48c3-bfe1-e68bac0a1195/omNode-1/db.snapshots/checkpointState/om.db-4714bd6c-6f41-4d7e-8a8e-f31f8cb18aaa
>  in 194 milliseconds
> 2023-06-27 20:36:27,379 [OMDoubleBufferFlushThread] INFO  
> db.RDBCheckpointUtils 
> (RDBCheckpointUtils.java:waitForCheckpointDirectoryExist(59)) - Waited for 13 
> milliseconds for checkpoint directory 
> /Users/smeng/repo/ozone-master/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-5ed4ff4b-1eeb-48c3-bfe1-e68bac0a1195/omNode-1/db.snapshots/checkpointState/om.db-4714bd6c-6f41-4d7e-8a8e-f31f8cb18aaa
>  availability.
> 2023-06-27 20:36:27,381 [OMDoubleBufferFlushThread] INFO  
> om.OmSnapshotManager (OmSnapshotManager.java:createOmSnapshotCheckpoint(468)) 
> - Created checkpoint : 
> /Users/smeng/repo/ozone-master/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-5ed4ff4b-1eeb-48c3-bfe1-e68bac0a1195/omNode-1/db.snapshots/checkpointState/om.db-4714bd6c-6f41-4d7e-8a8e-f31f8cb18aaa
>  for snapshot snap-48567
> 2023-06-27 20:36:27,392 [OMDoubleBufferFlushThread] INFO  
> db.RDBCheckpointManager (RDBCheckpointManager.java:createCheckpoint(89)) - 
> Created checkpoint in rocksDB at 
> /Users/smeng/repo/ozone-master/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-5ed4ff4b-1eeb-48c3-bfe1-e68bac0a1195/omNode-2/db.snapshots/checkpointState/om.db-4714bd6c-6f41-4d7e-8a8e-f31f8cb18aaa
>  in 198 milliseconds
> 2023-06-27 20:36:27,392 [OMDoubleBufferFlushThread] INFO  
> db.RDBCheckpointUtils 
> (RDBCheckpointUtils.java:waitForCheckpointDirectoryExist(59)) - Waited for 0 
> milliseconds for checkpoint directory 
> /Users/smeng/repo/ozone-master/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-5ed4ff4b-1eeb-48c3-bfe1-e68bac0a1195/omNode-2/db.snapshots/checkpointState/om.db-4714bd6c-6f41-4d7e-8a8e-f31f8cb18aaa
>  availability.
> 2023-06-27 20:36:27,393 [OMDoubleBufferFlushThread] INFO  
> om.OmSnapshotManager (OmSnapshotManager.java:createOmSnapshotCheckpoint(468)) 
> - Created checkpoint : 
> /Users/smeng/repo/ozone-master/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-5ed4ff4b-1eeb-48c3-bfe1-e68bac0a1195/omNode-2/db.snapshots/checkpointState/om.db-4714bd6c-6f41-4d7e-8a8e-f31f8cb18aaa
>  for snapshot snap-48567
> 2023-06-27 20:36:27,396 [OMDoubleBufferFlushThread] INFO  
> db.RDBCheckpointManager (RDBCheckpointManager.java:createCheckpoint(89)) - 
> Created checkpoint in rocksDB at 
> /Users/smeng/repo/ozone-master/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-5ed4ff4b-1eeb-48c3-bfe1-e68bac0a1195/omNode-3/db.snapshots/checkpointState/om.db-4714bd6c-6f41-4d7e-8a8e-f31f8cb18aaa
>  in 206 milliseconds
> 2023-06-27 20:36:27,396 [OMDoubleBufferFlushThread] INFO  
> db.RDBCheckpointUtils 
> (RDBCheckpointUtils.java:waitForCheckpointDirectoryExist(59)) - Waited for 0 
> milliseconds for checkpoint directory 
> /Users/smeng/repo/ozone-master/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-5ed4ff4b-1eeb-48c3-bfe1-e68bac0a1195/omNode-3/db.snapshots/checkpointState/om.db-4714bd6c-6f41-4d7e-8a8e-f31f8cb18aaa
>  availability.
> 2023-06-27 20:36:27,397 [OMDoubleBufferFlushThread] INFO  
> om.OmSnapshotManager (OmSnapshotManager.java:createOmSnapshotCheckpoint(468)) 
> - Created checkpoint : 
> /Users/smeng/repo/ozone-master/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-5ed4ff4b-1eeb-48c3-bfe1-e68bac0a1195/omNode-3/db.snapshots/checkpointState/om.db-4714bd6c-6f41-4d7e-8a8e-f31f8cb18aaa
>  for snapshot snap-48567
> 2023-06-27 20:36:27,399 [omNode-3@group-6E8B7B074D31-StateMachineUpdater] 
> INFO  impl.StateMachineUpdater (StateMachineUpdater.java:takeSnapshot(291)) - 
> omNode-3@group-6E8B7B074D31-StateMachineUpdater: Took a snapshot at index 4
> 2023-06-27 20:36:27,399 [omNode-3@group-6E8B7B074D31-StateMachineUpdater] 
> INFO  ratis.OzoneManagerStateMachine 
> (OzoneManagerStateMachine.java:close(533)) - StateMachine has shutdown. 
> Shutdown OzoneManager if not already shutdown.
> 2023-06-27 20:36:27,399 [omNode-3@group-6E8B7B074D31-StateMachineUpdater] 
> INFO  ratis.OzoneManagerDoubleBuffer 
> (OzoneManagerDoubleBuffer.java:stopDaemon(551)) - Stopping OMDoubleBuffer 
> flush thread
> 2023-06-27 20:36:27,400 [omNode-3-impl-thread1] INFO  
> server.RaftServer$Division (ServerState.java:close(472)) - 
> omNode-3@group-6E8B7B074D31: closes. applyIndex: 5
> 2023-06-27 20:36:27,680 [grpc-default-executor-5] WARN  
> server.GrpcLogAppender (LogUtils.java:warn(122)) - 
> omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed 
> appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
> UNAVAILABLE: io exception
> 2023-06-27 20:36:27,681 [grpc-default-executor-2] WARN  
> server.GrpcLogAppender (LogUtils.java:warn(122)) - 
> omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed 
> appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
> UNAVAILABLE: io exception
> 2023-06-27 20:36:27,681 [grpc-default-executor-5] INFO  leader.FollowerInfo 
> (FollowerInfoImpl.java:lambda$new$0(52)) - 
> omNode-1@group-6E8B7B074D31->omNode-3: decreaseNextIndex nextIndex: 
> updateUnconditionally 4 -> 3
> 2023-06-27 20:36:27,681 [grpc-default-executor-2] INFO  leader.FollowerInfo 
> (FollowerInfoImpl.java:lambda$new$0(52)) - 
> omNode-1@group-6E8B7B074D31->omNode-3: decreaseNextIndex nextIndex: 
> updateUnconditionally 3 -> 2
> 2023-06-27 20:36:27,683 [grpc-default-executor-5] WARN  
> server.GrpcLogAppender (LogUtils.java:warn(122)) - 
> omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed 
> appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
> UNAVAILABLE: io exception
> 2023-06-27 20:36:27,683 [grpc-default-executor-2] WARN  
> server.GrpcLogAppender (LogUtils.java:warn(122)) - 
> omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed 
> appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
> UNAVAILABLE: io exception
> 2023-06-27 20:36:27,684 [grpc-default-executor-5] INFO  leader.FollowerInfo 
> (FollowerInfoImpl.java:lambda$new$0(52)) - 
> omNode-1@group-6E8B7B074D31->omNode-3: decreaseNextIndex nextIndex: 
> updateUnconditionally 7 -> 6
> 2023-06-27 20:36:27,684 [grpc-default-executor-2] INFO  leader.FollowerInfo 
> (FollowerInfoImpl.java:lambda$new$0(52)) - 
> omNode-1@group-6E8B7B074D31->omNode-3: decreaseNextIndex nextIndex: 
> updateUnconditionally 6 -> 5
> 2023-06-27 20:36:27,690 [grpc-default-executor-2] WARN  
> server.GrpcLogAppender (LogUtils.java:warn(122)) - 
> omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed 
> appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
> UNAVAILABLE: io exception
> 2023-06-27 20:36:27,690 [grpc-default-executor-5] WARN  
> server.GrpcLogAppender (LogUtils.java:warn(122)) - 
> omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed 
> appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
> UNAVAILABLE: io exception
> 2023-06-27 20:36:27,690 [grpc-default-executor-2] INFO  leader.FollowerInfo 
> (FollowerInfoImpl.java:lambda$new$0(52)) - 
> omNode-1@group-6E8B7B074D31->omNode-3: decreaseNextIndex nextIndex: 
> updateUnconditionally 7 -> 6
> 2023-06-27 20:36:27,690 [grpc-default-executor-5] INFO  leader.FollowerInfo 
> (FollowerInfoImpl.java:lambda$new$0(52)) - 
> omNode-1@group-6E8B7B074D31->omNode-3: decreaseNextIndex nextIndex: 
> updateUnconditionally 6 -> 5
> 2023-06-27 20:36:27,930 [grpc-default-executor-5] WARN  
> server.GrpcLogAppender (LogUtils.java:warn(122)) - 
> omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed 
> appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
> UNAVAILABLE: io exception
> 2023-06-27 20:36:27,930 [grpc-default-executor-2] WARN  
> server.GrpcLogAppender (LogUtils.java:warn(122)) - 
> omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed 
> appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
> UNAVAILABLE: io exception
> 2023-06-27 20:36:27,932 [grpc-default-executor-5] INFO  leader.FollowerInfo 
> (FollowerInfoImpl.java:lambda$new$0(52)) - 
> omNode-1@group-6E8B7B074D31->omNode-3: decreaseNextIndex nextIndex: 
> updateUnconditionally 5 -> 4
> 2023-06-27 20:36:27,932 [grpc-default-executor-2] INFO  leader.FollowerInfo 
> (FollowerInfoImpl.java:lambda$new$0(52)) - 
> omNode-1@group-6E8B7B074D31->omNode-3: decreaseNextIndex nextIndex: 
> updateUnconditionally 4 -> 3
> {code}
> {code}
> 2023-06-27 20:36:57,356 [grpc-default-executor-5] WARN  
> server.GrpcLogAppender (LogUtils.java:warn(122)) - 
> omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed 
> appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
> UNAVAILABLE: io exception
> 2023-06-27 20:36:57,356 [grpc-default-executor-6] WARN  
> server.GrpcLogAppender (LogUtils.java:warn(122)) - 
> omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed 
> appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
> UNAVAILABLE: io exception
> 2023-06-27 20:36:57,607 [grpc-default-executor-6] WARN  
> server.GrpcLogAppender (LogUtils.java:warn(122)) - 
> omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed 
> appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
> UNAVAILABLE: io exception
> 2023-06-27 20:36:57,607 [grpc-default-executor-5] WARN  
> server.GrpcLogAppender (LogUtils.java:warn(122)) - 
> omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed 
> appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
> UNAVAILABLE: io exception
> 2023-06-27 20:36:57,861 [grpc-default-executor-6] WARN  
> server.GrpcLogAppender (LogUtils.java:warn(122)) - 
> omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed 
> appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
> UNAVAILABLE: io exception
> 2023-06-27 20:36:57,861 [grpc-default-executor-5] WARN  
> server.GrpcLogAppender (LogUtils.java:warn(122)) - 
> omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed 
> appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
> UNAVAILABLE: io exception
> 2023-06-27 20:36:58,116 [grpc-default-executor-5] WARN  
> server.GrpcLogAppender (LogUtils.java:warn(122)) - 
> omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed 
> appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
> UNAVAILABLE: io exception
> 2023-06-27 20:36:58,116 [grpc-default-executor-6] WARN  
> server.GrpcLogAppender (LogUtils.java:warn(122)) - 
> omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed 
> appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
> UNAVAILABLE: io exception
> {code}
> Attached is a diff in MiniOzoneHA to help debug this issue. Apply the diff 
> and run `TestOmSnapshotDisabledRestart#testSnapshotFeatureFlag` to repro.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to