[
https://issues.apache.org/jira/browse/HDDS-8945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Siyao Meng updated HDDS-8945:
-----------------------------
Attachment: HDDS-8945.debug.001.diff
> Investigate potential Ratis issue or incorrect usage in OM integrationt test
> ----------------------------------------------------------------------------
>
> Key: HDDS-8945
> URL: https://issues.apache.org/jira/browse/HDDS-8945
> Project: Apache Ozone
> Issue Type: Bug
> Reporter: Siyao Meng
> Priority: Major
> Attachments: HDDS-8945.debug.001.diff
>
>
> The integration test `TestOmSnapshotDisabledRestart` added in HDDS-8555 would
> occasionally pass, but most of the time it would fail at the OM restart step
> because om.stop() is stuck at Ratis retrying to quit, resulting in OM main
> RocksDB LOCK not being released when it is restarting, printing a ton of
> {{Failed appendEntries:
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io
> exception}} WARM message before it times out:
> {code}
> 2023-06-27 20:36:27,174 [omNode-3-impl-thread1] INFO impl.RoleInfo
> (RoleInfo.java:shutdownFollowerState(110)) - omNode-3: shutdown
> omNode-3@group-6E8B7B074D31-FollowerState
> 2023-06-27 20:36:27,174 [omNode-3@group-6E8B7B074D31-FollowerState] INFO
> impl.FollowerState (FollowerState.java:run(152)) -
> omNode-3@group-6E8B7B074D31-FollowerState was interrupted
> 2023-06-27 20:36:27,174 [omNode-3-impl-thread1] INFO
> impl.StateMachineUpdater (StateMachineUpdater.java:stopAndJoin(155)) -
> omNode-3@group-6E8B7B074D31-StateMachineUpdater: set stopIndex = 5
> 2023-06-27 20:36:27,175 [OM StateMachine ApplyTransaction Thread - 0] INFO
> snapshot.OMSnapshotCreateRequest
> (OMSnapshotCreateRequest.java:validateAndUpdateCache(193)) - Created
> snapshot: 'snap-48567' with snapshotId:
> '4714bd6c-6f41-4d7e-8a8e-f31f8cb18aaa' under path 'vol-47883/buck-56159'
> 2023-06-27 20:36:27,175 [omNode-3@group-6E8B7B074D31-StateMachineUpdater]
> INFO ratis.OzoneManagerStateMachine
> (OzoneManagerStateMachine.java:takeSnapshot(479)) - Current Snapshot Index
> (t:1, i:4)
> 2023-06-27 20:36:27,178 [OM StateMachine ApplyTransaction Thread - 0] INFO
> snapshot.OMSnapshotCreateRequest
> (OMSnapshotCreateRequest.java:validateAndUpdateCache(193)) - Created
> snapshot: 'snap-48567' with snapshotId:
> '4714bd6c-6f41-4d7e-8a8e-f31f8cb18aaa' under path 'vol-47883/buck-56159'
> 2023-06-27 20:36:27,180 [grpc-default-executor-6] WARN
> server.GrpcServerProtocolService (LogUtils.java:warn(122)) - omNode-3:
> APPEND_ENTRIES onError, lastRequest: null:
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: client
> cancelled
> 2023-06-27 20:36:27,180 [grpc-default-executor-2] WARN
> server.GrpcServerProtocolService (LogUtils.java:warn(122)) - omNode-3:
> APPEND_ENTRIES onError, lastRequest: omNode-1->omNode-3#38-t1,previous=(t:1,
> i:5),leaderCommit=5,initializing? true,entries: size=1, first=(t:1, i:6),
> METADATAENTRY(c:5):
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: client
> cancelled
> 2023-06-27 20:36:27,180 [grpc-default-executor-5] WARN
> server.GrpcServerProtocolService (LogUtils.java:warn(122)) - omNode-3:
> APPEND_ENTRIES onError, lastRequest: null:
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: client
> cancelled
> 2023-06-27 20:36:27,181 [main] INFO server.GrpcService
> (GrpcService.java:closeImpl(320)) - omNode-3: shutdown server
> GrpcServerProtocolService successfully
> 2023-06-27 20:36:27,181 [grpc-default-executor-5] WARN
> server.GrpcLogAppender (LogUtils.java:warn(122)) -
> omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed
> appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
> CANCELLED: RST_STREAM closed stream. HTTP/2 error code: CANCEL
> 2023-06-27 20:36:27,181 [grpc-default-executor-2] WARN
> server.GrpcLogAppender (LogUtils.java:warn(122)) -
> omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed
> appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
> CANCELLED: RST_STREAM closed stream. HTTP/2 error code: CANCEL
> 2023-06-27 20:36:27,181 [grpc-default-executor-6] WARN
> server.GrpcLogAppender (LogUtils.java:warn(122)) -
> omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed
> appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
> CANCELLED: RST_STREAM closed stream. HTTP/2 error code: CANCEL
> 2023-06-27 20:36:27,182 [grpc-default-executor-5] INFO leader.FollowerInfo
> (FollowerInfoImpl.java:lambda$new$0(52)) -
> omNode-1@group-6E8B7B074D31->omNode-3: decreaseNextIndex nextIndex:
> updateUnconditionally 7 -> 6
> 2023-06-27 20:36:27,182 [grpc-default-executor-2] INFO leader.FollowerInfo
> (FollowerInfoImpl.java:lambda$new$0(52)) -
> omNode-1@group-6E8B7B074D31->omNode-3: decreaseNextIndex nextIndex:
> updateUnconditionally 6 -> 5
> 2023-06-27 20:36:27,182 [grpc-default-executor-6] INFO leader.FollowerInfo
> (FollowerInfoImpl.java:lambda$new$0(52)) -
> omNode-1@group-6E8B7B074D31->omNode-3: decreaseNextIndex nextIndex:
> updateUnconditionally 5 -> 4
> 2023-06-27 20:36:27,349 [omNode-3@group-6E8B7B074D31-StateMachineUpdater]
> INFO impl.StateMachineUpdater (StateMachineUpdater.java:takeSnapshot(291)) -
> omNode-3@group-6E8B7B074D31-StateMachineUpdater: Took a snapshot at index 4
> 2023-06-27 20:36:27,349 [omNode-3@group-6E8B7B074D31-StateMachineUpdater]
> INFO impl.StateMachineUpdater (StateMachineUpdater.java:lambda$new$0(97)) -
> omNode-3@group-6E8B7B074D31-StateMachineUpdater: snapshotIndex:
> updateIncreasingly -1 -> 4
> 2023-06-27 20:36:27,350 [omNode-3@group-6E8B7B074D31-StateMachineUpdater]
> INFO ratis.OzoneManagerStateMachine
> (OzoneManagerStateMachine.java:takeSnapshot(479)) - Current Snapshot Index
> (t:1, i:4)
> 2023-06-27 20:36:27,365 [OMDoubleBufferFlushThread] INFO
> db.RDBCheckpointManager (RDBCheckpointManager.java:createCheckpoint(89)) -
> Created checkpoint in rocksDB at
> /Users/smeng/repo/ozone-master/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-5ed4ff4b-1eeb-48c3-bfe1-e68bac0a1195/omNode-1/db.snapshots/checkpointState/om.db-4714bd6c-6f41-4d7e-8a8e-f31f8cb18aaa
> in 194 milliseconds
> 2023-06-27 20:36:27,379 [OMDoubleBufferFlushThread] INFO
> db.RDBCheckpointUtils
> (RDBCheckpointUtils.java:waitForCheckpointDirectoryExist(59)) - Waited for 13
> milliseconds for checkpoint directory
> /Users/smeng/repo/ozone-master/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-5ed4ff4b-1eeb-48c3-bfe1-e68bac0a1195/omNode-1/db.snapshots/checkpointState/om.db-4714bd6c-6f41-4d7e-8a8e-f31f8cb18aaa
> availability.
> 2023-06-27 20:36:27,381 [OMDoubleBufferFlushThread] INFO
> om.OmSnapshotManager (OmSnapshotManager.java:createOmSnapshotCheckpoint(468))
> - Created checkpoint :
> /Users/smeng/repo/ozone-master/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-5ed4ff4b-1eeb-48c3-bfe1-e68bac0a1195/omNode-1/db.snapshots/checkpointState/om.db-4714bd6c-6f41-4d7e-8a8e-f31f8cb18aaa
> for snapshot snap-48567
> 2023-06-27 20:36:27,392 [OMDoubleBufferFlushThread] INFO
> db.RDBCheckpointManager (RDBCheckpointManager.java:createCheckpoint(89)) -
> Created checkpoint in rocksDB at
> /Users/smeng/repo/ozone-master/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-5ed4ff4b-1eeb-48c3-bfe1-e68bac0a1195/omNode-2/db.snapshots/checkpointState/om.db-4714bd6c-6f41-4d7e-8a8e-f31f8cb18aaa
> in 198 milliseconds
> 2023-06-27 20:36:27,392 [OMDoubleBufferFlushThread] INFO
> db.RDBCheckpointUtils
> (RDBCheckpointUtils.java:waitForCheckpointDirectoryExist(59)) - Waited for 0
> milliseconds for checkpoint directory
> /Users/smeng/repo/ozone-master/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-5ed4ff4b-1eeb-48c3-bfe1-e68bac0a1195/omNode-2/db.snapshots/checkpointState/om.db-4714bd6c-6f41-4d7e-8a8e-f31f8cb18aaa
> availability.
> 2023-06-27 20:36:27,393 [OMDoubleBufferFlushThread] INFO
> om.OmSnapshotManager (OmSnapshotManager.java:createOmSnapshotCheckpoint(468))
> - Created checkpoint :
> /Users/smeng/repo/ozone-master/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-5ed4ff4b-1eeb-48c3-bfe1-e68bac0a1195/omNode-2/db.snapshots/checkpointState/om.db-4714bd6c-6f41-4d7e-8a8e-f31f8cb18aaa
> for snapshot snap-48567
> 2023-06-27 20:36:27,396 [OMDoubleBufferFlushThread] INFO
> db.RDBCheckpointManager (RDBCheckpointManager.java:createCheckpoint(89)) -
> Created checkpoint in rocksDB at
> /Users/smeng/repo/ozone-master/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-5ed4ff4b-1eeb-48c3-bfe1-e68bac0a1195/omNode-3/db.snapshots/checkpointState/om.db-4714bd6c-6f41-4d7e-8a8e-f31f8cb18aaa
> in 206 milliseconds
> 2023-06-27 20:36:27,396 [OMDoubleBufferFlushThread] INFO
> db.RDBCheckpointUtils
> (RDBCheckpointUtils.java:waitForCheckpointDirectoryExist(59)) - Waited for 0
> milliseconds for checkpoint directory
> /Users/smeng/repo/ozone-master/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-5ed4ff4b-1eeb-48c3-bfe1-e68bac0a1195/omNode-3/db.snapshots/checkpointState/om.db-4714bd6c-6f41-4d7e-8a8e-f31f8cb18aaa
> availability.
> 2023-06-27 20:36:27,397 [OMDoubleBufferFlushThread] INFO
> om.OmSnapshotManager (OmSnapshotManager.java:createOmSnapshotCheckpoint(468))
> - Created checkpoint :
> /Users/smeng/repo/ozone-master/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-5ed4ff4b-1eeb-48c3-bfe1-e68bac0a1195/omNode-3/db.snapshots/checkpointState/om.db-4714bd6c-6f41-4d7e-8a8e-f31f8cb18aaa
> for snapshot snap-48567
> 2023-06-27 20:36:27,399 [omNode-3@group-6E8B7B074D31-StateMachineUpdater]
> INFO impl.StateMachineUpdater (StateMachineUpdater.java:takeSnapshot(291)) -
> omNode-3@group-6E8B7B074D31-StateMachineUpdater: Took a snapshot at index 4
> 2023-06-27 20:36:27,399 [omNode-3@group-6E8B7B074D31-StateMachineUpdater]
> INFO ratis.OzoneManagerStateMachine
> (OzoneManagerStateMachine.java:close(533)) - StateMachine has shutdown.
> Shutdown OzoneManager if not already shutdown.
> 2023-06-27 20:36:27,399 [omNode-3@group-6E8B7B074D31-StateMachineUpdater]
> INFO ratis.OzoneManagerDoubleBuffer
> (OzoneManagerDoubleBuffer.java:stopDaemon(551)) - Stopping OMDoubleBuffer
> flush thread
> 2023-06-27 20:36:27,400 [omNode-3-impl-thread1] INFO
> server.RaftServer$Division (ServerState.java:close(472)) -
> omNode-3@group-6E8B7B074D31: closes. applyIndex: 5
> 2023-06-27 20:36:27,680 [grpc-default-executor-5] WARN
> server.GrpcLogAppender (LogUtils.java:warn(122)) -
> omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed
> appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
> UNAVAILABLE: io exception
> 2023-06-27 20:36:27,681 [grpc-default-executor-2] WARN
> server.GrpcLogAppender (LogUtils.java:warn(122)) -
> omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed
> appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
> UNAVAILABLE: io exception
> 2023-06-27 20:36:27,681 [grpc-default-executor-5] INFO leader.FollowerInfo
> (FollowerInfoImpl.java:lambda$new$0(52)) -
> omNode-1@group-6E8B7B074D31->omNode-3: decreaseNextIndex nextIndex:
> updateUnconditionally 4 -> 3
> 2023-06-27 20:36:27,681 [grpc-default-executor-2] INFO leader.FollowerInfo
> (FollowerInfoImpl.java:lambda$new$0(52)) -
> omNode-1@group-6E8B7B074D31->omNode-3: decreaseNextIndex nextIndex:
> updateUnconditionally 3 -> 2
> 2023-06-27 20:36:27,683 [grpc-default-executor-5] WARN
> server.GrpcLogAppender (LogUtils.java:warn(122)) -
> omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed
> appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
> UNAVAILABLE: io exception
> 2023-06-27 20:36:27,683 [grpc-default-executor-2] WARN
> server.GrpcLogAppender (LogUtils.java:warn(122)) -
> omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed
> appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
> UNAVAILABLE: io exception
> 2023-06-27 20:36:27,684 [grpc-default-executor-5] INFO leader.FollowerInfo
> (FollowerInfoImpl.java:lambda$new$0(52)) -
> omNode-1@group-6E8B7B074D31->omNode-3: decreaseNextIndex nextIndex:
> updateUnconditionally 7 -> 6
> 2023-06-27 20:36:27,684 [grpc-default-executor-2] INFO leader.FollowerInfo
> (FollowerInfoImpl.java:lambda$new$0(52)) -
> omNode-1@group-6E8B7B074D31->omNode-3: decreaseNextIndex nextIndex:
> updateUnconditionally 6 -> 5
> 2023-06-27 20:36:27,690 [grpc-default-executor-2] WARN
> server.GrpcLogAppender (LogUtils.java:warn(122)) -
> omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed
> appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
> UNAVAILABLE: io exception
> 2023-06-27 20:36:27,690 [grpc-default-executor-5] WARN
> server.GrpcLogAppender (LogUtils.java:warn(122)) -
> omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed
> appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
> UNAVAILABLE: io exception
> 2023-06-27 20:36:27,690 [grpc-default-executor-2] INFO leader.FollowerInfo
> (FollowerInfoImpl.java:lambda$new$0(52)) -
> omNode-1@group-6E8B7B074D31->omNode-3: decreaseNextIndex nextIndex:
> updateUnconditionally 7 -> 6
> 2023-06-27 20:36:27,690 [grpc-default-executor-5] INFO leader.FollowerInfo
> (FollowerInfoImpl.java:lambda$new$0(52)) -
> omNode-1@group-6E8B7B074D31->omNode-3: decreaseNextIndex nextIndex:
> updateUnconditionally 6 -> 5
> 2023-06-27 20:36:27,930 [grpc-default-executor-5] WARN
> server.GrpcLogAppender (LogUtils.java:warn(122)) -
> omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed
> appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
> UNAVAILABLE: io exception
> 2023-06-27 20:36:27,930 [grpc-default-executor-2] WARN
> server.GrpcLogAppender (LogUtils.java:warn(122)) -
> omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed
> appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
> UNAVAILABLE: io exception
> 2023-06-27 20:36:27,932 [grpc-default-executor-5] INFO leader.FollowerInfo
> (FollowerInfoImpl.java:lambda$new$0(52)) -
> omNode-1@group-6E8B7B074D31->omNode-3: decreaseNextIndex nextIndex:
> updateUnconditionally 5 -> 4
> 2023-06-27 20:36:27,932 [grpc-default-executor-2] INFO leader.FollowerInfo
> (FollowerInfoImpl.java:lambda$new$0(52)) -
> omNode-1@group-6E8B7B074D31->omNode-3: decreaseNextIndex nextIndex:
> updateUnconditionally 4 -> 3
> {code}
> {code}
> 2023-06-27 20:36:57,356 [grpc-default-executor-5] WARN
> server.GrpcLogAppender (LogUtils.java:warn(122)) -
> omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed
> appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
> UNAVAILABLE: io exception
> 2023-06-27 20:36:57,356 [grpc-default-executor-6] WARN
> server.GrpcLogAppender (LogUtils.java:warn(122)) -
> omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed
> appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
> UNAVAILABLE: io exception
> 2023-06-27 20:36:57,607 [grpc-default-executor-6] WARN
> server.GrpcLogAppender (LogUtils.java:warn(122)) -
> omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed
> appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
> UNAVAILABLE: io exception
> 2023-06-27 20:36:57,607 [grpc-default-executor-5] WARN
> server.GrpcLogAppender (LogUtils.java:warn(122)) -
> omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed
> appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
> UNAVAILABLE: io exception
> 2023-06-27 20:36:57,861 [grpc-default-executor-6] WARN
> server.GrpcLogAppender (LogUtils.java:warn(122)) -
> omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed
> appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
> UNAVAILABLE: io exception
> 2023-06-27 20:36:57,861 [grpc-default-executor-5] WARN
> server.GrpcLogAppender (LogUtils.java:warn(122)) -
> omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed
> appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
> UNAVAILABLE: io exception
> 2023-06-27 20:36:58,116 [grpc-default-executor-5] WARN
> server.GrpcLogAppender (LogUtils.java:warn(122)) -
> omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed
> appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
> UNAVAILABLE: io exception
> 2023-06-27 20:36:58,116 [grpc-default-executor-6] WARN
> server.GrpcLogAppender (LogUtils.java:warn(122)) -
> omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed
> appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
> UNAVAILABLE: io exception
> {code}
> Attached is a diff in MiniOzoneHA to help debug this issue. Apply the diff
> and run `TestOmSnapshotDisabledRestart#testSnapshotFeatureFlag` to repro.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]