[
https://issues.apache.org/jira/browse/RATIS-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Duong updated RATIS-2054:
-------------------------
Description:
I believe RATIS-2045 results in a regression. A lot of Ozone integration tests
fail after including this commit, probably because nodes can't be added to a
ratis ring with no logs entries.
Sample run:
[https://github.com/duongkame/ozone/actions/runs/8623155740/job/23636444671,|https://github.com/duongkame/ozone/actions/runs/8623155740/job/23636444671]
Sample failed test: TestAddRemoveOzoneManager.testBootstrap
Behavior before the commit:
{code:java}
2024-04-10 17:31:36,897 [grpc-default-executor-2] INFO
ratis.OzoneManagerStateMachine
(OzoneManagerStateMachine.java:notifyConfigurationChanged(212)) - Received
Configuration change notification from Ratis. New Peer list:
[id: "omNode-1"
address: "localhost:15015"
startupRole: FOLLOWER
]
2024-04-10 17:31:36,905 [grpc-default-executor-2] INFO
ratis.OzoneManagerRatisServer (OzoneManagerRatisServer.java:addRaftPeer(434)) -
Added OM omNode-1 to Ratis Peers list.
2024-04-10 17:31:36,906 [grpc-default-executor-2] INFO om.OzoneManager
(OzoneManager.java:addOMNodeToPeers(2042)) - Added OM omNode-1 to the Peer list.
2024-04-10 17:31:36,909 [grpc-default-executor-2] INFO
impl.SnapshotInstallationHandler
(SnapshotInstallationHandler.java:installSnapshot(103)) -
omNode-bootstrap-1@group-0AAC5367B30E: reply installSnapshot:
omNode-1<-omNode-bootstrap-1#0:OK-t0,ALREADY_INSTALLED,snapshotIndex=0
2024-04-10 17:31:36,922 [grpc-default-executor-2] INFO
server.GrpcServerProtocolService
(GrpcServerProtocolService.java:onCompleted(200)) - omNode-bootstrap-1:
Completed INSTALL_SNAPSHOT, lastRequest:
omNode-1->omNode-bootstrap-1#0-t1,notify:(t:1, i:0)
2024-04-10 17:31:36,923 [grpc-default-executor-2] INFO
server.GrpcServerProtocolService
(GrpcServerProtocolService.java:lambda$onCompleted$7(202)) -
omNode-bootstrap-1: Completed INSTALL_SNAPSHOT, lastReply: null
2024-04-10 17:31:36,924 [grpc-default-executor-0] INFO server.GrpcLogAppender
(GrpcLogAppender.java:onNext(658)) -
omNode-1@group-0AAC5367B30E->omNode-bootstrap-1-InstallSnapshotResponseHandler:
received the first reply
omNode-1<-omNode-bootstrap-1#0:OK-t0,ALREADY_INSTALLED,snapshotIndex=0
2024-04-10 17:31:36,929 [grpc-default-executor-0] INFO server.GrpcLogAppender
(GrpcLogAppender.java:onNext(679)) -
omNode-1@group-0AAC5367B30E->omNode-bootstrap-1-InstallSnapshotResponseHandler:
Follower snapshot is already at index 0.
2024-04-10 17:31:36,930 [grpc-default-executor-0] INFO leader.FollowerInfo
(FollowerInfoImpl.java:info(64)) -
omNode-1@group-0AAC5367B30E->omNode-bootstrap-1: matchIndex: setUnconditionally
-1 -> 0
2024-04-10 17:31:36,930 [grpc-default-executor-0] INFO leader.FollowerInfo
(FollowerInfoImpl.java:info(64)) -
omNode-1@group-0AAC5367B30E->omNode-bootstrap-1: nextIndex: setUnconditionally
0 -> 1 {code}
Behavior after the commit:
{code:java}
2024-04-10 17:46:58,830 [grpc-default-executor-8] INFO
ratis.OzoneManagerStateMachine
(OzoneManagerStateMachine.java:notifyConfigurationChanged(212)) - Received
Configuration change notification from Ratis. New Peer list:
[id: "omNode-1"
address: "localhost:15015"
startupRole: FOLLOWER
]
2024-04-10 17:46:58,842 [grpc-default-executor-8] INFO
ratis.OzoneManagerRatisServer (OzoneManagerRatisServer.java:addRaftPeer(434)) -
Added OM omNode-1 to Ratis Peers list.
2024-04-10 17:46:58,842 [grpc-default-executor-8] INFO om.OzoneManager
(OzoneManager.java:addOMNodeToPeers(2042)) - Added OM omNode-1 to the Peer list.
2024-04-10 17:46:58,847 [grpc-default-executor-8] INFO
impl.SnapshotInstallationHandler
(SnapshotInstallationHandler.java:installSnapshot(103)) -
omNode-bootstrap-1@group-0AAC5367B30E: reply installSnapshot:
omNode-1<-omNode-bootstrap-1#0:FAIL-t0,IN_PROGRESS,snapshotIndex=0
2024-04-10 17:46:58,862 [omNode-bootstrap-1-InstallSnapshotThread] INFO
ratis_snapshot.OmRatisSnapshotProvider
(OmRatisSnapshotProvider.java:downloadSnapshot(146)) - Downloading latest
checkpoint from Leader OM omNode-1. Checkpoint URL:
http://127.0.0.1:15013/dbCheckpoint?includeSnapshotData=true&flushBeforeCheckpoint=true
2024-04-10 17:46:58,870 [grpc-default-executor-8] INFO
server.GrpcServerProtocolService
(GrpcServerProtocolService.java:onCompleted(200)) - omNode-bootstrap-1:
Completed INSTALL_SNAPSHOT, lastRequest:
omNode-1->omNode-bootstrap-1#0-t1,notify:(t:1, i:0)
2024-04-10 17:46:58,871 [grpc-default-executor-7] INFO server.GrpcLogAppender
(GrpcLogAppender.java:onNext(658)) -
omNode-1@group-0AAC5367B30E->omNode-bootstrap-1-InstallSnapshotResponseHandler:
received the first reply
omNode-1<-omNode-bootstrap-1#0:FAIL-t0,IN_PROGRESS,snapshotIndex=0
2024-04-10 17:46:58,875 [grpc-default-executor-7] INFO server.GrpcLogAppender
(GrpcLogAppender.java:onNext(674)) -
omNode-1@group-0AAC5367B30E->omNode-bootstrap-1-InstallSnapshotResponseHandler:
InstallSnapshot in progress.
2024-04-10 17:46:58,876 [grpc-default-executor-8] INFO
server.GrpcServerProtocolService
(GrpcServerProtocolService.java:lambda$onCompleted$7(202)) -
omNode-bootstrap-1: Completed INSTALL_SNAPSHOT, lastReply: null {code}
was:
I believe RATIS-2045 results in a regression. A lot of Ozone integration tests
fail after including this commit, probably because nodes can't be added to a
ratis ring with no logs entries.
Sample run:
[https://github.com/duongkame/ozone/actions/runs/8623155740/job/23636444671]
Behavior before the commit:
{code:java}
2024-04-10 17:31:36,897 [grpc-default-executor-2] INFO
ratis.OzoneManagerStateMachine
(OzoneManagerStateMachine.java:notifyConfigurationChanged(212)) - Received
Configuration change notification from Ratis. New Peer list:
[id: "omNode-1"
address: "localhost:15015"
startupRole: FOLLOWER
]
2024-04-10 17:31:36,905 [grpc-default-executor-2] INFO
ratis.OzoneManagerRatisServer (OzoneManagerRatisServer.java:addRaftPeer(434)) -
Added OM omNode-1 to Ratis Peers list.
2024-04-10 17:31:36,906 [grpc-default-executor-2] INFO om.OzoneManager
(OzoneManager.java:addOMNodeToPeers(2042)) - Added OM omNode-1 to the Peer list.
2024-04-10 17:31:36,909 [grpc-default-executor-2] INFO
impl.SnapshotInstallationHandler
(SnapshotInstallationHandler.java:installSnapshot(103)) -
omNode-bootstrap-1@group-0AAC5367B30E: reply installSnapshot:
omNode-1<-omNode-bootstrap-1#0:OK-t0,ALREADY_INSTALLED,snapshotIndex=0
2024-04-10 17:31:36,922 [grpc-default-executor-2] INFO
server.GrpcServerProtocolService
(GrpcServerProtocolService.java:onCompleted(200)) - omNode-bootstrap-1:
Completed INSTALL_SNAPSHOT, lastRequest:
omNode-1->omNode-bootstrap-1#0-t1,notify:(t:1, i:0)
2024-04-10 17:31:36,923 [grpc-default-executor-2] INFO
server.GrpcServerProtocolService
(GrpcServerProtocolService.java:lambda$onCompleted$7(202)) -
omNode-bootstrap-1: Completed INSTALL_SNAPSHOT, lastReply: null
2024-04-10 17:31:36,924 [grpc-default-executor-0] INFO server.GrpcLogAppender
(GrpcLogAppender.java:onNext(658)) -
omNode-1@group-0AAC5367B30E->omNode-bootstrap-1-InstallSnapshotResponseHandler:
received the first reply
omNode-1<-omNode-bootstrap-1#0:OK-t0,ALREADY_INSTALLED,snapshotIndex=0
2024-04-10 17:31:36,929 [grpc-default-executor-0] INFO server.GrpcLogAppender
(GrpcLogAppender.java:onNext(679)) -
omNode-1@group-0AAC5367B30E->omNode-bootstrap-1-InstallSnapshotResponseHandler:
Follower snapshot is already at index 0.
2024-04-10 17:31:36,930 [grpc-default-executor-0] INFO leader.FollowerInfo
(FollowerInfoImpl.java:info(64)) -
omNode-1@group-0AAC5367B30E->omNode-bootstrap-1: matchIndex: setUnconditionally
-1 -> 0
2024-04-10 17:31:36,930 [grpc-default-executor-0] INFO leader.FollowerInfo
(FollowerInfoImpl.java:info(64)) -
omNode-1@group-0AAC5367B30E->omNode-bootstrap-1: nextIndex: setUnconditionally
0 -> 1 {code}
Behavior after the commit:
{code:java}
2024-04-10 17:46:58,830 [grpc-default-executor-8] INFO
ratis.OzoneManagerStateMachine
(OzoneManagerStateMachine.java:notifyConfigurationChanged(212)) - Received
Configuration change notification from Ratis. New Peer list:
[id: "omNode-1"
address: "localhost:15015"
startupRole: FOLLOWER
]
2024-04-10 17:46:58,842 [grpc-default-executor-8] INFO
ratis.OzoneManagerRatisServer (OzoneManagerRatisServer.java:addRaftPeer(434)) -
Added OM omNode-1 to Ratis Peers list.
2024-04-10 17:46:58,842 [grpc-default-executor-8] INFO om.OzoneManager
(OzoneManager.java:addOMNodeToPeers(2042)) - Added OM omNode-1 to the Peer list.
2024-04-10 17:46:58,847 [grpc-default-executor-8] INFO
impl.SnapshotInstallationHandler
(SnapshotInstallationHandler.java:installSnapshot(103)) -
omNode-bootstrap-1@group-0AAC5367B30E: reply installSnapshot:
omNode-1<-omNode-bootstrap-1#0:FAIL-t0,IN_PROGRESS,snapshotIndex=0
2024-04-10 17:46:58,862 [omNode-bootstrap-1-InstallSnapshotThread] INFO
ratis_snapshot.OmRatisSnapshotProvider
(OmRatisSnapshotProvider.java:downloadSnapshot(146)) - Downloading latest
checkpoint from Leader OM omNode-1. Checkpoint URL:
http://127.0.0.1:15013/dbCheckpoint?includeSnapshotData=true&flushBeforeCheckpoint=true
2024-04-10 17:46:58,870 [grpc-default-executor-8] INFO
server.GrpcServerProtocolService
(GrpcServerProtocolService.java:onCompleted(200)) - omNode-bootstrap-1:
Completed INSTALL_SNAPSHOT, lastRequest:
omNode-1->omNode-bootstrap-1#0-t1,notify:(t:1, i:0)
2024-04-10 17:46:58,871 [grpc-default-executor-7] INFO server.GrpcLogAppender
(GrpcLogAppender.java:onNext(658)) -
omNode-1@group-0AAC5367B30E->omNode-bootstrap-1-InstallSnapshotResponseHandler:
received the first reply
omNode-1<-omNode-bootstrap-1#0:FAIL-t0,IN_PROGRESS,snapshotIndex=0
2024-04-10 17:46:58,875 [grpc-default-executor-7] INFO server.GrpcLogAppender
(GrpcLogAppender.java:onNext(674)) -
omNode-1@group-0AAC5367B30E->omNode-bootstrap-1-InstallSnapshotResponseHandler:
InstallSnapshot in progress.
2024-04-10 17:46:58,876 [grpc-default-executor-8] INFO
server.GrpcServerProtocolService
(GrpcServerProtocolService.java:lambda$onCompleted$7(202)) -
omNode-bootstrap-1: Completed INSTALL_SNAPSHOT, lastReply: null {code}
> Ozone integration test fails because of empty snapshot installation.
> --------------------------------------------------------------------
>
> Key: RATIS-2054
> URL: https://issues.apache.org/jira/browse/RATIS-2054
> Project: Ratis
> Issue Type: Improvement
> Reporter: Duong
> Priority: Major
>
> I believe RATIS-2045 results in a regression. A lot of Ozone integration
> tests fail after including this commit, probably because nodes can't be added
> to a ratis ring with no logs entries.
> Sample run:
> [https://github.com/duongkame/ozone/actions/runs/8623155740/job/23636444671,|https://github.com/duongkame/ozone/actions/runs/8623155740/job/23636444671]
> Sample failed test: TestAddRemoveOzoneManager.testBootstrap
>
> Behavior before the commit:
> {code:java}
> 2024-04-10 17:31:36,897 [grpc-default-executor-2] INFO
> ratis.OzoneManagerStateMachine
> (OzoneManagerStateMachine.java:notifyConfigurationChanged(212)) - Received
> Configuration change notification from Ratis. New Peer list:
> [id: "omNode-1"
> address: "localhost:15015"
> startupRole: FOLLOWER
> ]
> 2024-04-10 17:31:36,905 [grpc-default-executor-2] INFO
> ratis.OzoneManagerRatisServer (OzoneManagerRatisServer.java:addRaftPeer(434))
> - Added OM omNode-1 to Ratis Peers list.
> 2024-04-10 17:31:36,906 [grpc-default-executor-2] INFO om.OzoneManager
> (OzoneManager.java:addOMNodeToPeers(2042)) - Added OM omNode-1 to the Peer
> list.
> 2024-04-10 17:31:36,909 [grpc-default-executor-2] INFO
> impl.SnapshotInstallationHandler
> (SnapshotInstallationHandler.java:installSnapshot(103)) -
> omNode-bootstrap-1@group-0AAC5367B30E: reply installSnapshot:
> omNode-1<-omNode-bootstrap-1#0:OK-t0,ALREADY_INSTALLED,snapshotIndex=0
> 2024-04-10 17:31:36,922 [grpc-default-executor-2] INFO
> server.GrpcServerProtocolService
> (GrpcServerProtocolService.java:onCompleted(200)) - omNode-bootstrap-1:
> Completed INSTALL_SNAPSHOT, lastRequest:
> omNode-1->omNode-bootstrap-1#0-t1,notify:(t:1, i:0)
> 2024-04-10 17:31:36,923 [grpc-default-executor-2] INFO
> server.GrpcServerProtocolService
> (GrpcServerProtocolService.java:lambda$onCompleted$7(202)) -
> omNode-bootstrap-1: Completed INSTALL_SNAPSHOT, lastReply: null
> 2024-04-10 17:31:36,924 [grpc-default-executor-0] INFO
> server.GrpcLogAppender (GrpcLogAppender.java:onNext(658)) -
> omNode-1@group-0AAC5367B30E->omNode-bootstrap-1-InstallSnapshotResponseHandler:
> received the first reply
> omNode-1<-omNode-bootstrap-1#0:OK-t0,ALREADY_INSTALLED,snapshotIndex=0
> 2024-04-10 17:31:36,929 [grpc-default-executor-0] INFO
> server.GrpcLogAppender (GrpcLogAppender.java:onNext(679)) -
> omNode-1@group-0AAC5367B30E->omNode-bootstrap-1-InstallSnapshotResponseHandler:
> Follower snapshot is already at index 0.
> 2024-04-10 17:31:36,930 [grpc-default-executor-0] INFO leader.FollowerInfo
> (FollowerInfoImpl.java:info(64)) -
> omNode-1@group-0AAC5367B30E->omNode-bootstrap-1: matchIndex:
> setUnconditionally -1 -> 0
> 2024-04-10 17:31:36,930 [grpc-default-executor-0] INFO leader.FollowerInfo
> (FollowerInfoImpl.java:info(64)) -
> omNode-1@group-0AAC5367B30E->omNode-bootstrap-1: nextIndex:
> setUnconditionally 0 -> 1 {code}
> Behavior after the commit:
> {code:java}
> 2024-04-10 17:46:58,830 [grpc-default-executor-8] INFO
> ratis.OzoneManagerStateMachine
> (OzoneManagerStateMachine.java:notifyConfigurationChanged(212)) - Received
> Configuration change notification from Ratis. New Peer list:
> [id: "omNode-1"
> address: "localhost:15015"
> startupRole: FOLLOWER
> ]
> 2024-04-10 17:46:58,842 [grpc-default-executor-8] INFO
> ratis.OzoneManagerRatisServer (OzoneManagerRatisServer.java:addRaftPeer(434))
> - Added OM omNode-1 to Ratis Peers list.
> 2024-04-10 17:46:58,842 [grpc-default-executor-8] INFO om.OzoneManager
> (OzoneManager.java:addOMNodeToPeers(2042)) - Added OM omNode-1 to the Peer
> list.
> 2024-04-10 17:46:58,847 [grpc-default-executor-8] INFO
> impl.SnapshotInstallationHandler
> (SnapshotInstallationHandler.java:installSnapshot(103)) -
> omNode-bootstrap-1@group-0AAC5367B30E: reply installSnapshot:
> omNode-1<-omNode-bootstrap-1#0:FAIL-t0,IN_PROGRESS,snapshotIndex=0
> 2024-04-10 17:46:58,862 [omNode-bootstrap-1-InstallSnapshotThread] INFO
> ratis_snapshot.OmRatisSnapshotProvider
> (OmRatisSnapshotProvider.java:downloadSnapshot(146)) - Downloading latest
> checkpoint from Leader OM omNode-1. Checkpoint URL:
> http://127.0.0.1:15013/dbCheckpoint?includeSnapshotData=true&flushBeforeCheckpoint=true
> 2024-04-10 17:46:58,870 [grpc-default-executor-8] INFO
> server.GrpcServerProtocolService
> (GrpcServerProtocolService.java:onCompleted(200)) - omNode-bootstrap-1:
> Completed INSTALL_SNAPSHOT, lastRequest:
> omNode-1->omNode-bootstrap-1#0-t1,notify:(t:1, i:0)
> 2024-04-10 17:46:58,871 [grpc-default-executor-7] INFO
> server.GrpcLogAppender (GrpcLogAppender.java:onNext(658)) -
> omNode-1@group-0AAC5367B30E->omNode-bootstrap-1-InstallSnapshotResponseHandler:
> received the first reply
> omNode-1<-omNode-bootstrap-1#0:FAIL-t0,IN_PROGRESS,snapshotIndex=0
> 2024-04-10 17:46:58,875 [grpc-default-executor-7] INFO
> server.GrpcLogAppender (GrpcLogAppender.java:onNext(674)) -
> omNode-1@group-0AAC5367B30E->omNode-bootstrap-1-InstallSnapshotResponseHandler:
> InstallSnapshot in progress.
> 2024-04-10 17:46:58,876 [grpc-default-executor-8] INFO
> server.GrpcServerProtocolService
> (GrpcServerProtocolService.java:lambda$onCompleted$7(202)) -
> omNode-bootstrap-1: Completed INSTALL_SNAPSHOT, lastReply: null {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)