Siyao Meng created HDDS-8945:
--------------------------------
Summary: Investigate potential Ratis issue or incorrect usage in
OM integrationt test
Key: HDDS-8945
URL: https://issues.apache.org/jira/browse/HDDS-8945
Project: Apache Ozone
Issue Type: Bug
Reporter: Siyao Meng
The integration test `TestOmSnapshotDisabledRestart` added in HDDS-8555 would
occasionally pass, but most of the time it would fail at the OM restart step
because om.stop() is stuck at Ratis retrying to quit, resulting in OM main
RocksDB LOCK not being released when it is restarting, printing a ton of
{{Failed appendEntries:
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io
exception}} WARM message before it times out:
{code}
2023-06-27 20:36:27,174 [omNode-3-impl-thread1] INFO impl.RoleInfo
(RoleInfo.java:shutdownFollowerState(110)) - omNode-3: shutdown
omNode-3@group-6E8B7B074D31-FollowerState
2023-06-27 20:36:27,174 [omNode-3@group-6E8B7B074D31-FollowerState] INFO
impl.FollowerState (FollowerState.java:run(152)) -
omNode-3@group-6E8B7B074D31-FollowerState was interrupted
2023-06-27 20:36:27,174 [omNode-3-impl-thread1] INFO impl.StateMachineUpdater
(StateMachineUpdater.java:stopAndJoin(155)) -
omNode-3@group-6E8B7B074D31-StateMachineUpdater: set stopIndex = 5
2023-06-27 20:36:27,175 [OM StateMachine ApplyTransaction Thread - 0] INFO
snapshot.OMSnapshotCreateRequest
(OMSnapshotCreateRequest.java:validateAndUpdateCache(193)) - Created snapshot:
'snap-48567' with snapshotId: '4714bd6c-6f41-4d7e-8a8e-f31f8cb18aaa' under path
'vol-47883/buck-56159'
2023-06-27 20:36:27,175 [omNode-3@group-6E8B7B074D31-StateMachineUpdater] INFO
ratis.OzoneManagerStateMachine
(OzoneManagerStateMachine.java:takeSnapshot(479)) - Current Snapshot Index
(t:1, i:4)
2023-06-27 20:36:27,178 [OM StateMachine ApplyTransaction Thread - 0] INFO
snapshot.OMSnapshotCreateRequest
(OMSnapshotCreateRequest.java:validateAndUpdateCache(193)) - Created snapshot:
'snap-48567' with snapshotId: '4714bd6c-6f41-4d7e-8a8e-f31f8cb18aaa' under path
'vol-47883/buck-56159'
2023-06-27 20:36:27,180 [grpc-default-executor-6] WARN
server.GrpcServerProtocolService (LogUtils.java:warn(122)) - omNode-3:
APPEND_ENTRIES onError, lastRequest: null:
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: client
cancelled
2023-06-27 20:36:27,180 [grpc-default-executor-2] WARN
server.GrpcServerProtocolService (LogUtils.java:warn(122)) - omNode-3:
APPEND_ENTRIES onError, lastRequest: omNode-1->omNode-3#38-t1,previous=(t:1,
i:5),leaderCommit=5,initializing? true,entries: size=1, first=(t:1, i:6),
METADATAENTRY(c:5): org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
CANCELLED: client cancelled
2023-06-27 20:36:27,180 [grpc-default-executor-5] WARN
server.GrpcServerProtocolService (LogUtils.java:warn(122)) - omNode-3:
APPEND_ENTRIES onError, lastRequest: null:
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: client
cancelled
2023-06-27 20:36:27,181 [main] INFO server.GrpcService
(GrpcService.java:closeImpl(320)) - omNode-3: shutdown server
GrpcServerProtocolService successfully
2023-06-27 20:36:27,181 [grpc-default-executor-5] WARN server.GrpcLogAppender
(LogUtils.java:warn(122)) -
omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed
appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
CANCELLED: RST_STREAM closed stream. HTTP/2 error code: CANCEL
2023-06-27 20:36:27,181 [grpc-default-executor-2] WARN server.GrpcLogAppender
(LogUtils.java:warn(122)) -
omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed
appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
CANCELLED: RST_STREAM closed stream. HTTP/2 error code: CANCEL
2023-06-27 20:36:27,181 [grpc-default-executor-6] WARN server.GrpcLogAppender
(LogUtils.java:warn(122)) -
omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed
appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
CANCELLED: RST_STREAM closed stream. HTTP/2 error code: CANCEL
2023-06-27 20:36:27,182 [grpc-default-executor-5] INFO leader.FollowerInfo
(FollowerInfoImpl.java:lambda$new$0(52)) -
omNode-1@group-6E8B7B074D31->omNode-3: decreaseNextIndex nextIndex:
updateUnconditionally 7 -> 6
2023-06-27 20:36:27,182 [grpc-default-executor-2] INFO leader.FollowerInfo
(FollowerInfoImpl.java:lambda$new$0(52)) -
omNode-1@group-6E8B7B074D31->omNode-3: decreaseNextIndex nextIndex:
updateUnconditionally 6 -> 5
2023-06-27 20:36:27,182 [grpc-default-executor-6] INFO leader.FollowerInfo
(FollowerInfoImpl.java:lambda$new$0(52)) -
omNode-1@group-6E8B7B074D31->omNode-3: decreaseNextIndex nextIndex:
updateUnconditionally 5 -> 4
2023-06-27 20:36:27,349 [omNode-3@group-6E8B7B074D31-StateMachineUpdater] INFO
impl.StateMachineUpdater (StateMachineUpdater.java:takeSnapshot(291)) -
omNode-3@group-6E8B7B074D31-StateMachineUpdater: Took a snapshot at index 4
2023-06-27 20:36:27,349 [omNode-3@group-6E8B7B074D31-StateMachineUpdater] INFO
impl.StateMachineUpdater (StateMachineUpdater.java:lambda$new$0(97)) -
omNode-3@group-6E8B7B074D31-StateMachineUpdater: snapshotIndex:
updateIncreasingly -1 -> 4
2023-06-27 20:36:27,350 [omNode-3@group-6E8B7B074D31-StateMachineUpdater] INFO
ratis.OzoneManagerStateMachine
(OzoneManagerStateMachine.java:takeSnapshot(479)) - Current Snapshot Index
(t:1, i:4)
2023-06-27 20:36:27,365 [OMDoubleBufferFlushThread] INFO
db.RDBCheckpointManager (RDBCheckpointManager.java:createCheckpoint(89)) -
Created checkpoint in rocksDB at
/Users/smeng/repo/ozone-master/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-5ed4ff4b-1eeb-48c3-bfe1-e68bac0a1195/omNode-1/db.snapshots/checkpointState/om.db-4714bd6c-6f41-4d7e-8a8e-f31f8cb18aaa
in 194 milliseconds
2023-06-27 20:36:27,379 [OMDoubleBufferFlushThread] INFO db.RDBCheckpointUtils
(RDBCheckpointUtils.java:waitForCheckpointDirectoryExist(59)) - Waited for 13
milliseconds for checkpoint directory
/Users/smeng/repo/ozone-master/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-5ed4ff4b-1eeb-48c3-bfe1-e68bac0a1195/omNode-1/db.snapshots/checkpointState/om.db-4714bd6c-6f41-4d7e-8a8e-f31f8cb18aaa
availability.
2023-06-27 20:36:27,381 [OMDoubleBufferFlushThread] INFO om.OmSnapshotManager
(OmSnapshotManager.java:createOmSnapshotCheckpoint(468)) - Created checkpoint :
/Users/smeng/repo/ozone-master/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-5ed4ff4b-1eeb-48c3-bfe1-e68bac0a1195/omNode-1/db.snapshots/checkpointState/om.db-4714bd6c-6f41-4d7e-8a8e-f31f8cb18aaa
for snapshot snap-48567
2023-06-27 20:36:27,392 [OMDoubleBufferFlushThread] INFO
db.RDBCheckpointManager (RDBCheckpointManager.java:createCheckpoint(89)) -
Created checkpoint in rocksDB at
/Users/smeng/repo/ozone-master/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-5ed4ff4b-1eeb-48c3-bfe1-e68bac0a1195/omNode-2/db.snapshots/checkpointState/om.db-4714bd6c-6f41-4d7e-8a8e-f31f8cb18aaa
in 198 milliseconds
2023-06-27 20:36:27,392 [OMDoubleBufferFlushThread] INFO db.RDBCheckpointUtils
(RDBCheckpointUtils.java:waitForCheckpointDirectoryExist(59)) - Waited for 0
milliseconds for checkpoint directory
/Users/smeng/repo/ozone-master/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-5ed4ff4b-1eeb-48c3-bfe1-e68bac0a1195/omNode-2/db.snapshots/checkpointState/om.db-4714bd6c-6f41-4d7e-8a8e-f31f8cb18aaa
availability.
2023-06-27 20:36:27,393 [OMDoubleBufferFlushThread] INFO om.OmSnapshotManager
(OmSnapshotManager.java:createOmSnapshotCheckpoint(468)) - Created checkpoint :
/Users/smeng/repo/ozone-master/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-5ed4ff4b-1eeb-48c3-bfe1-e68bac0a1195/omNode-2/db.snapshots/checkpointState/om.db-4714bd6c-6f41-4d7e-8a8e-f31f8cb18aaa
for snapshot snap-48567
2023-06-27 20:36:27,396 [OMDoubleBufferFlushThread] INFO
db.RDBCheckpointManager (RDBCheckpointManager.java:createCheckpoint(89)) -
Created checkpoint in rocksDB at
/Users/smeng/repo/ozone-master/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-5ed4ff4b-1eeb-48c3-bfe1-e68bac0a1195/omNode-3/db.snapshots/checkpointState/om.db-4714bd6c-6f41-4d7e-8a8e-f31f8cb18aaa
in 206 milliseconds
2023-06-27 20:36:27,396 [OMDoubleBufferFlushThread] INFO db.RDBCheckpointUtils
(RDBCheckpointUtils.java:waitForCheckpointDirectoryExist(59)) - Waited for 0
milliseconds for checkpoint directory
/Users/smeng/repo/ozone-master/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-5ed4ff4b-1eeb-48c3-bfe1-e68bac0a1195/omNode-3/db.snapshots/checkpointState/om.db-4714bd6c-6f41-4d7e-8a8e-f31f8cb18aaa
availability.
2023-06-27 20:36:27,397 [OMDoubleBufferFlushThread] INFO om.OmSnapshotManager
(OmSnapshotManager.java:createOmSnapshotCheckpoint(468)) - Created checkpoint :
/Users/smeng/repo/ozone-master/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-5ed4ff4b-1eeb-48c3-bfe1-e68bac0a1195/omNode-3/db.snapshots/checkpointState/om.db-4714bd6c-6f41-4d7e-8a8e-f31f8cb18aaa
for snapshot snap-48567
2023-06-27 20:36:27,399 [omNode-3@group-6E8B7B074D31-StateMachineUpdater] INFO
impl.StateMachineUpdater (StateMachineUpdater.java:takeSnapshot(291)) -
omNode-3@group-6E8B7B074D31-StateMachineUpdater: Took a snapshot at index 4
2023-06-27 20:36:27,399 [omNode-3@group-6E8B7B074D31-StateMachineUpdater] INFO
ratis.OzoneManagerStateMachine (OzoneManagerStateMachine.java:close(533)) -
StateMachine has shutdown. Shutdown OzoneManager if not already shutdown.
2023-06-27 20:36:27,399 [omNode-3@group-6E8B7B074D31-StateMachineUpdater] INFO
ratis.OzoneManagerDoubleBuffer (OzoneManagerDoubleBuffer.java:stopDaemon(551))
- Stopping OMDoubleBuffer flush thread
2023-06-27 20:36:27,400 [omNode-3-impl-thread1] INFO
server.RaftServer$Division (ServerState.java:close(472)) -
omNode-3@group-6E8B7B074D31: closes. applyIndex: 5
2023-06-27 20:36:27,680 [grpc-default-executor-5] WARN server.GrpcLogAppender
(LogUtils.java:warn(122)) -
omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed
appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
UNAVAILABLE: io exception
2023-06-27 20:36:27,681 [grpc-default-executor-2] WARN server.GrpcLogAppender
(LogUtils.java:warn(122)) -
omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed
appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
UNAVAILABLE: io exception
2023-06-27 20:36:27,681 [grpc-default-executor-5] INFO leader.FollowerInfo
(FollowerInfoImpl.java:lambda$new$0(52)) -
omNode-1@group-6E8B7B074D31->omNode-3: decreaseNextIndex nextIndex:
updateUnconditionally 4 -> 3
2023-06-27 20:36:27,681 [grpc-default-executor-2] INFO leader.FollowerInfo
(FollowerInfoImpl.java:lambda$new$0(52)) -
omNode-1@group-6E8B7B074D31->omNode-3: decreaseNextIndex nextIndex:
updateUnconditionally 3 -> 2
2023-06-27 20:36:27,683 [grpc-default-executor-5] WARN server.GrpcLogAppender
(LogUtils.java:warn(122)) -
omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed
appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
UNAVAILABLE: io exception
2023-06-27 20:36:27,683 [grpc-default-executor-2] WARN server.GrpcLogAppender
(LogUtils.java:warn(122)) -
omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed
appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
UNAVAILABLE: io exception
2023-06-27 20:36:27,684 [grpc-default-executor-5] INFO leader.FollowerInfo
(FollowerInfoImpl.java:lambda$new$0(52)) -
omNode-1@group-6E8B7B074D31->omNode-3: decreaseNextIndex nextIndex:
updateUnconditionally 7 -> 6
2023-06-27 20:36:27,684 [grpc-default-executor-2] INFO leader.FollowerInfo
(FollowerInfoImpl.java:lambda$new$0(52)) -
omNode-1@group-6E8B7B074D31->omNode-3: decreaseNextIndex nextIndex:
updateUnconditionally 6 -> 5
2023-06-27 20:36:27,690 [grpc-default-executor-2] WARN server.GrpcLogAppender
(LogUtils.java:warn(122)) -
omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed
appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
UNAVAILABLE: io exception
2023-06-27 20:36:27,690 [grpc-default-executor-5] WARN server.GrpcLogAppender
(LogUtils.java:warn(122)) -
omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed
appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
UNAVAILABLE: io exception
2023-06-27 20:36:27,690 [grpc-default-executor-2] INFO leader.FollowerInfo
(FollowerInfoImpl.java:lambda$new$0(52)) -
omNode-1@group-6E8B7B074D31->omNode-3: decreaseNextIndex nextIndex:
updateUnconditionally 7 -> 6
2023-06-27 20:36:27,690 [grpc-default-executor-5] INFO leader.FollowerInfo
(FollowerInfoImpl.java:lambda$new$0(52)) -
omNode-1@group-6E8B7B074D31->omNode-3: decreaseNextIndex nextIndex:
updateUnconditionally 6 -> 5
2023-06-27 20:36:27,930 [grpc-default-executor-5] WARN server.GrpcLogAppender
(LogUtils.java:warn(122)) -
omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed
appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
UNAVAILABLE: io exception
2023-06-27 20:36:27,930 [grpc-default-executor-2] WARN server.GrpcLogAppender
(LogUtils.java:warn(122)) -
omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed
appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
UNAVAILABLE: io exception
2023-06-27 20:36:27,932 [grpc-default-executor-5] INFO leader.FollowerInfo
(FollowerInfoImpl.java:lambda$new$0(52)) -
omNode-1@group-6E8B7B074D31->omNode-3: decreaseNextIndex nextIndex:
updateUnconditionally 5 -> 4
2023-06-27 20:36:27,932 [grpc-default-executor-2] INFO leader.FollowerInfo
(FollowerInfoImpl.java:lambda$new$0(52)) -
omNode-1@group-6E8B7B074D31->omNode-3: decreaseNextIndex nextIndex:
updateUnconditionally 4 -> 3
{code}
{code}
2023-06-27 20:36:57,356 [grpc-default-executor-5] WARN server.GrpcLogAppender
(LogUtils.java:warn(122)) -
omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed
appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
UNAVAILABLE: io exception
2023-06-27 20:36:57,356 [grpc-default-executor-6] WARN server.GrpcLogAppender
(LogUtils.java:warn(122)) -
omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed
appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
UNAVAILABLE: io exception
2023-06-27 20:36:57,607 [grpc-default-executor-6] WARN server.GrpcLogAppender
(LogUtils.java:warn(122)) -
omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed
appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
UNAVAILABLE: io exception
2023-06-27 20:36:57,607 [grpc-default-executor-5] WARN server.GrpcLogAppender
(LogUtils.java:warn(122)) -
omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed
appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
UNAVAILABLE: io exception
2023-06-27 20:36:57,861 [grpc-default-executor-6] WARN server.GrpcLogAppender
(LogUtils.java:warn(122)) -
omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed
appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
UNAVAILABLE: io exception
2023-06-27 20:36:57,861 [grpc-default-executor-5] WARN server.GrpcLogAppender
(LogUtils.java:warn(122)) -
omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed
appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
UNAVAILABLE: io exception
2023-06-27 20:36:58,116 [grpc-default-executor-5] WARN server.GrpcLogAppender
(LogUtils.java:warn(122)) -
omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed
appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
UNAVAILABLE: io exception
2023-06-27 20:36:58,116 [grpc-default-executor-6] WARN server.GrpcLogAppender
(LogUtils.java:warn(122)) -
omNode-1@group-6E8B7B074D31->omNode-3-AppendLogResponseHandler: Failed
appendEntries: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
UNAVAILABLE: io exception
{code}
Attached is a diff in MiniOzoneHA to help debug this issue. Apply the diff and
run `TestOmSnapshotDisabledRestart#testSnapshotFeatureFlag` to repro.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]