Siyao Meng created HDDS-8945:
--------------------------------

             Summary: Investigate potential Ratis issue or incorrect usage in 
OM integrationt test
                 Key: HDDS-8945
                 URL: https://issues.apache.org/jira/browse/HDDS-8945
             Project: Apache Ozone
          Issue Type: Bug
            Reporter: Siyao Meng


The integration test `TestOmSnapshotDisabledRestart` added in HDDS-8555 would 
occasionally pass, but most of the time it would fail at the OM restart step 
because om.stop() is stuck at Ratis retrying to quit, resulting in OM main 
RocksDB LOCK not being released when it is restarting, printing a ton of 
{{Failed appendEntries: 
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
exception}} WARM message before it times out:

{code}
2023-06-27 20:36:27,174 [omNode-3-impl-thread1] INFO  impl.RoleInfo 
(RoleInfo.java:shutdownFollowerState(110)) - omNode-3: shutdown 
omNode-3@group-6E8B7B074D31-FollowerState
2023-06-27 20:36:27,174 [omNode-3@group-6E8B7B074D31-FollowerState] INFO  
impl.FollowerState (FollowerState.java:run(152)) - 
omNode-3@group-6E8B7B074D31-FollowerState was interrupted
2023-06-27 20:36:27,174 [omNode-3-impl-thread1] INFO  impl.StateMachineUpdater 
(StateMachineUpdater.java:stopAndJoin(155)) - 
omNode-3@group-6E8B7B074D31-StateMachineUpdater: set stopIndex = 5
2023-06-27 20:36:27,175 [OM StateMachine ApplyTransaction Thread - 0] INFO  
snapshot.OMSnapshotCreateRequest 
(OMSnapshotCreateRequest.java:validateAndUpdateCache(193)) - Created snapshot: 
'snap-48567' with snapshotId: '4714bd6c-6f41-4d7e-8a8e-f31f8cb18aaa' under path 
'vol-47883/buck-56159'
2023-06-27 20:36:27,175 [omNode-3@group-6E8B7B074D31-StateMachineUpdater] INFO  
ratis.OzoneManagerStateMachine 
(OzoneManagerStateMachine.java:takeSnapshot(479)) - Current Snapshot Index 
(t:1, i:4)
2023-06-27 20:36:27,178 [OM StateMachine ApplyTransaction Thread - 0] INFO  
snapshot.OMSnapshotCreateRequest 
(OMSnapshotCreateRequest.java:validateAndUpdateCache(193)) - Created snapshot: 
'snap-48567' with snapshotId: '4714bd6c-6f41-4d7e-8a8e-f31f8cb18aaa' under path 
'vol-47883/buck-56159'
2023-06-27 20:36:27,180 [grpc-default-executor-6] WARN  
server.GrpcServerProtocolService (LogUtils.java:warn(122)) - omNode-3: 
APPEND_ENTRIES onError, lastRequest: null: 
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: client 
cancelled
2023-06-27 20:36:27,180 [grpc-default-executor-2] WARN  
server.GrpcServerProtocolService (LogUtils.java:warn(122)) - omNode-3: 
APPEND_ENTRIES onError, lastRequest: omNode-1->omNode-3#38-t1,previous=(t:1, 
i:5),leaderCommit=5,initializing? true,entries: size=1, first=(t:1, i:6), 
METADATAENTRY(c:5): org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
CANCELLED: client cancelled
2023-06-27 20:36:27,180 [grpc-default-executor-5] WARN  
server.GrpcServerProtocolService (LogUtils.java:warn(122)) - omNode-3: 
APPEND_ENTRIES onError, lastRequest: null: 
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: client 
cancelled
2023-06-27 20:36:27,181 [main] INFO  server.GrpcService 
(GrpcService.java:closeImpl(320)) - omNode-3: shutdown server 
GrpcServerProtocolService successfully
2023-06-27 20:36:27,181 [grpc-default-executor-5] WARN  server.GrpcLogAppender 
(LogUtils.java:warn(122)) - 
omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed 
appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
CANCELLED: RST_STREAM closed stream. HTTP/2 error code: CANCEL
2023-06-27 20:36:27,181 [grpc-default-executor-2] WARN  server.GrpcLogAppender 
(LogUtils.java:warn(122)) - 
omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed 
appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
CANCELLED: RST_STREAM closed stream. HTTP/2 error code: CANCEL
2023-06-27 20:36:27,181 [grpc-default-executor-6] WARN  server.GrpcLogAppender 
(LogUtils.java:warn(122)) - 
omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed 
appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
CANCELLED: RST_STREAM closed stream. HTTP/2 error code: CANCEL
2023-06-27 20:36:27,182 [grpc-default-executor-5] INFO  leader.FollowerInfo 
(FollowerInfoImpl.java:lambda$new$0(52)) - 
omNode-1@group-6E8B7B074D31->omNode-3: decreaseNextIndex nextIndex: 
updateUnconditionally 7 -> 6
2023-06-27 20:36:27,182 [grpc-default-executor-2] INFO  leader.FollowerInfo 
(FollowerInfoImpl.java:lambda$new$0(52)) - 
omNode-1@group-6E8B7B074D31->omNode-3: decreaseNextIndex nextIndex: 
updateUnconditionally 6 -> 5
2023-06-27 20:36:27,182 [grpc-default-executor-6] INFO  leader.FollowerInfo 
(FollowerInfoImpl.java:lambda$new$0(52)) - 
omNode-1@group-6E8B7B074D31->omNode-3: decreaseNextIndex nextIndex: 
updateUnconditionally 5 -> 4
2023-06-27 20:36:27,349 [omNode-3@group-6E8B7B074D31-StateMachineUpdater] INFO  
impl.StateMachineUpdater (StateMachineUpdater.java:takeSnapshot(291)) - 
omNode-3@group-6E8B7B074D31-StateMachineUpdater: Took a snapshot at index 4
2023-06-27 20:36:27,349 [omNode-3@group-6E8B7B074D31-StateMachineUpdater] INFO  
impl.StateMachineUpdater (StateMachineUpdater.java:lambda$new$0(97)) - 
omNode-3@group-6E8B7B074D31-StateMachineUpdater: snapshotIndex: 
updateIncreasingly -1 -> 4
2023-06-27 20:36:27,350 [omNode-3@group-6E8B7B074D31-StateMachineUpdater] INFO  
ratis.OzoneManagerStateMachine 
(OzoneManagerStateMachine.java:takeSnapshot(479)) - Current Snapshot Index 
(t:1, i:4)
2023-06-27 20:36:27,365 [OMDoubleBufferFlushThread] INFO  
db.RDBCheckpointManager (RDBCheckpointManager.java:createCheckpoint(89)) - 
Created checkpoint in rocksDB at 
/Users/smeng/repo/ozone-master/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-5ed4ff4b-1eeb-48c3-bfe1-e68bac0a1195/omNode-1/db.snapshots/checkpointState/om.db-4714bd6c-6f41-4d7e-8a8e-f31f8cb18aaa
 in 194 milliseconds
2023-06-27 20:36:27,379 [OMDoubleBufferFlushThread] INFO  db.RDBCheckpointUtils 
(RDBCheckpointUtils.java:waitForCheckpointDirectoryExist(59)) - Waited for 13 
milliseconds for checkpoint directory 
/Users/smeng/repo/ozone-master/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-5ed4ff4b-1eeb-48c3-bfe1-e68bac0a1195/omNode-1/db.snapshots/checkpointState/om.db-4714bd6c-6f41-4d7e-8a8e-f31f8cb18aaa
 availability.
2023-06-27 20:36:27,381 [OMDoubleBufferFlushThread] INFO  om.OmSnapshotManager 
(OmSnapshotManager.java:createOmSnapshotCheckpoint(468)) - Created checkpoint : 
/Users/smeng/repo/ozone-master/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-5ed4ff4b-1eeb-48c3-bfe1-e68bac0a1195/omNode-1/db.snapshots/checkpointState/om.db-4714bd6c-6f41-4d7e-8a8e-f31f8cb18aaa
 for snapshot snap-48567
2023-06-27 20:36:27,392 [OMDoubleBufferFlushThread] INFO  
db.RDBCheckpointManager (RDBCheckpointManager.java:createCheckpoint(89)) - 
Created checkpoint in rocksDB at 
/Users/smeng/repo/ozone-master/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-5ed4ff4b-1eeb-48c3-bfe1-e68bac0a1195/omNode-2/db.snapshots/checkpointState/om.db-4714bd6c-6f41-4d7e-8a8e-f31f8cb18aaa
 in 198 milliseconds
2023-06-27 20:36:27,392 [OMDoubleBufferFlushThread] INFO  db.RDBCheckpointUtils 
(RDBCheckpointUtils.java:waitForCheckpointDirectoryExist(59)) - Waited for 0 
milliseconds for checkpoint directory 
/Users/smeng/repo/ozone-master/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-5ed4ff4b-1eeb-48c3-bfe1-e68bac0a1195/omNode-2/db.snapshots/checkpointState/om.db-4714bd6c-6f41-4d7e-8a8e-f31f8cb18aaa
 availability.
2023-06-27 20:36:27,393 [OMDoubleBufferFlushThread] INFO  om.OmSnapshotManager 
(OmSnapshotManager.java:createOmSnapshotCheckpoint(468)) - Created checkpoint : 
/Users/smeng/repo/ozone-master/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-5ed4ff4b-1eeb-48c3-bfe1-e68bac0a1195/omNode-2/db.snapshots/checkpointState/om.db-4714bd6c-6f41-4d7e-8a8e-f31f8cb18aaa
 for snapshot snap-48567
2023-06-27 20:36:27,396 [OMDoubleBufferFlushThread] INFO  
db.RDBCheckpointManager (RDBCheckpointManager.java:createCheckpoint(89)) - 
Created checkpoint in rocksDB at 
/Users/smeng/repo/ozone-master/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-5ed4ff4b-1eeb-48c3-bfe1-e68bac0a1195/omNode-3/db.snapshots/checkpointState/om.db-4714bd6c-6f41-4d7e-8a8e-f31f8cb18aaa
 in 206 milliseconds
2023-06-27 20:36:27,396 [OMDoubleBufferFlushThread] INFO  db.RDBCheckpointUtils 
(RDBCheckpointUtils.java:waitForCheckpointDirectoryExist(59)) - Waited for 0 
milliseconds for checkpoint directory 
/Users/smeng/repo/ozone-master/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-5ed4ff4b-1eeb-48c3-bfe1-e68bac0a1195/omNode-3/db.snapshots/checkpointState/om.db-4714bd6c-6f41-4d7e-8a8e-f31f8cb18aaa
 availability.
2023-06-27 20:36:27,397 [OMDoubleBufferFlushThread] INFO  om.OmSnapshotManager 
(OmSnapshotManager.java:createOmSnapshotCheckpoint(468)) - Created checkpoint : 
/Users/smeng/repo/ozone-master/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-5ed4ff4b-1eeb-48c3-bfe1-e68bac0a1195/omNode-3/db.snapshots/checkpointState/om.db-4714bd6c-6f41-4d7e-8a8e-f31f8cb18aaa
 for snapshot snap-48567
2023-06-27 20:36:27,399 [omNode-3@group-6E8B7B074D31-StateMachineUpdater] INFO  
impl.StateMachineUpdater (StateMachineUpdater.java:takeSnapshot(291)) - 
omNode-3@group-6E8B7B074D31-StateMachineUpdater: Took a snapshot at index 4
2023-06-27 20:36:27,399 [omNode-3@group-6E8B7B074D31-StateMachineUpdater] INFO  
ratis.OzoneManagerStateMachine (OzoneManagerStateMachine.java:close(533)) - 
StateMachine has shutdown. Shutdown OzoneManager if not already shutdown.
2023-06-27 20:36:27,399 [omNode-3@group-6E8B7B074D31-StateMachineUpdater] INFO  
ratis.OzoneManagerDoubleBuffer (OzoneManagerDoubleBuffer.java:stopDaemon(551)) 
- Stopping OMDoubleBuffer flush thread
2023-06-27 20:36:27,400 [omNode-3-impl-thread1] INFO  
server.RaftServer$Division (ServerState.java:close(472)) - 
omNode-3@group-6E8B7B074D31: closes. applyIndex: 5
2023-06-27 20:36:27,680 [grpc-default-executor-5] WARN  server.GrpcLogAppender 
(LogUtils.java:warn(122)) - 
omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed 
appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
UNAVAILABLE: io exception
2023-06-27 20:36:27,681 [grpc-default-executor-2] WARN  server.GrpcLogAppender 
(LogUtils.java:warn(122)) - 
omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed 
appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
UNAVAILABLE: io exception
2023-06-27 20:36:27,681 [grpc-default-executor-5] INFO  leader.FollowerInfo 
(FollowerInfoImpl.java:lambda$new$0(52)) - 
omNode-1@group-6E8B7B074D31->omNode-3: decreaseNextIndex nextIndex: 
updateUnconditionally 4 -> 3
2023-06-27 20:36:27,681 [grpc-default-executor-2] INFO  leader.FollowerInfo 
(FollowerInfoImpl.java:lambda$new$0(52)) - 
omNode-1@group-6E8B7B074D31->omNode-3: decreaseNextIndex nextIndex: 
updateUnconditionally 3 -> 2
2023-06-27 20:36:27,683 [grpc-default-executor-5] WARN  server.GrpcLogAppender 
(LogUtils.java:warn(122)) - 
omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed 
appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
UNAVAILABLE: io exception
2023-06-27 20:36:27,683 [grpc-default-executor-2] WARN  server.GrpcLogAppender 
(LogUtils.java:warn(122)) - 
omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed 
appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
UNAVAILABLE: io exception
2023-06-27 20:36:27,684 [grpc-default-executor-5] INFO  leader.FollowerInfo 
(FollowerInfoImpl.java:lambda$new$0(52)) - 
omNode-1@group-6E8B7B074D31->omNode-3: decreaseNextIndex nextIndex: 
updateUnconditionally 7 -> 6
2023-06-27 20:36:27,684 [grpc-default-executor-2] INFO  leader.FollowerInfo 
(FollowerInfoImpl.java:lambda$new$0(52)) - 
omNode-1@group-6E8B7B074D31->omNode-3: decreaseNextIndex nextIndex: 
updateUnconditionally 6 -> 5
2023-06-27 20:36:27,690 [grpc-default-executor-2] WARN  server.GrpcLogAppender 
(LogUtils.java:warn(122)) - 
omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed 
appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
UNAVAILABLE: io exception
2023-06-27 20:36:27,690 [grpc-default-executor-5] WARN  server.GrpcLogAppender 
(LogUtils.java:warn(122)) - 
omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed 
appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
UNAVAILABLE: io exception
2023-06-27 20:36:27,690 [grpc-default-executor-2] INFO  leader.FollowerInfo 
(FollowerInfoImpl.java:lambda$new$0(52)) - 
omNode-1@group-6E8B7B074D31->omNode-3: decreaseNextIndex nextIndex: 
updateUnconditionally 7 -> 6
2023-06-27 20:36:27,690 [grpc-default-executor-5] INFO  leader.FollowerInfo 
(FollowerInfoImpl.java:lambda$new$0(52)) - 
omNode-1@group-6E8B7B074D31->omNode-3: decreaseNextIndex nextIndex: 
updateUnconditionally 6 -> 5
2023-06-27 20:36:27,930 [grpc-default-executor-5] WARN  server.GrpcLogAppender 
(LogUtils.java:warn(122)) - 
omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed 
appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
UNAVAILABLE: io exception
2023-06-27 20:36:27,930 [grpc-default-executor-2] WARN  server.GrpcLogAppender 
(LogUtils.java:warn(122)) - 
omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed 
appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
UNAVAILABLE: io exception
2023-06-27 20:36:27,932 [grpc-default-executor-5] INFO  leader.FollowerInfo 
(FollowerInfoImpl.java:lambda$new$0(52)) - 
omNode-1@group-6E8B7B074D31->omNode-3: decreaseNextIndex nextIndex: 
updateUnconditionally 5 -> 4
2023-06-27 20:36:27,932 [grpc-default-executor-2] INFO  leader.FollowerInfo 
(FollowerInfoImpl.java:lambda$new$0(52)) - 
omNode-1@group-6E8B7B074D31->omNode-3: decreaseNextIndex nextIndex: 
updateUnconditionally 4 -> 3
{code}

{code}
2023-06-27 20:36:57,356 [grpc-default-executor-5] WARN  server.GrpcLogAppender 
(LogUtils.java:warn(122)) - 
omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed 
appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
UNAVAILABLE: io exception
2023-06-27 20:36:57,356 [grpc-default-executor-6] WARN  server.GrpcLogAppender 
(LogUtils.java:warn(122)) - 
omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed 
appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
UNAVAILABLE: io exception
2023-06-27 20:36:57,607 [grpc-default-executor-6] WARN  server.GrpcLogAppender 
(LogUtils.java:warn(122)) - 
omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed 
appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
UNAVAILABLE: io exception
2023-06-27 20:36:57,607 [grpc-default-executor-5] WARN  server.GrpcLogAppender 
(LogUtils.java:warn(122)) - 
omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed 
appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
UNAVAILABLE: io exception
2023-06-27 20:36:57,861 [grpc-default-executor-6] WARN  server.GrpcLogAppender 
(LogUtils.java:warn(122)) - 
omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed 
appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
UNAVAILABLE: io exception
2023-06-27 20:36:57,861 [grpc-default-executor-5] WARN  server.GrpcLogAppender 
(LogUtils.java:warn(122)) - 
omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed 
appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
UNAVAILABLE: io exception
2023-06-27 20:36:58,116 [grpc-default-executor-5] WARN  server.GrpcLogAppender 
(LogUtils.java:warn(122)) - 
omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed 
appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
UNAVAILABLE: io exception
2023-06-27 20:36:58,116 [grpc-default-executor-6] WARN  server.GrpcLogAppender 
(LogUtils.java:warn(122)) - 
omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed 
appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
UNAVAILABLE: io exception
{code}

Attached is a diff in MiniOzoneHA to help debug this issue. Apply the diff and 
run `TestOmSnapshotDisabledRestart#testSnapshotFeatureFlag` to repro.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to