[ 
https://issues.apache.org/jira/browse/RATIS-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-2054:
------------------------------
    Component/s: snapshot
     Issue Type: Bug  (was: Improvement)

> Ozone integration test fails because of empty snapshot installation.
> --------------------------------------------------------------------
>
>                 Key: RATIS-2054
>                 URL: https://issues.apache.org/jira/browse/RATIS-2054
>             Project: Ratis
>          Issue Type: Bug
>          Components: snapshot
>    Affects Versions: 3.1.0
>            Reporter: Duong
>            Priority: Major
>
> I believe RATIS-2045 results in a regression. A lot of Ozone integration 
> tests fail after including this commit, probably because nodes can't be added 
> to a ratis ring with no logs entries.
> Sample run: 
> [https://github.com/duongkame/ozone/actions/runs/8623155740/job/23636444671,|https://github.com/duongkame/ozone/actions/runs/8623155740/job/23636444671]
> Sample failed test: TestAddRemoveOzoneManager.testBootstrap
>  
> Before the commit, seems that new nodes don't have to install snapshots (the 
> check installSnapshot return a ALREADY_INSTALLED).
> {code:java}
> 2024-04-10 17:31:36,897 [grpc-default-executor-2] INFO  
> ratis.OzoneManagerStateMachine 
> (OzoneManagerStateMachine.java:notifyConfigurationChanged(212)) - Received 
> Configuration change notification from Ratis. New Peer list:
> [id: "omNode-1"
> address: "localhost:15015"
> startupRole: FOLLOWER
> ]
> 2024-04-10 17:31:36,905 [grpc-default-executor-2] INFO  
> ratis.OzoneManagerRatisServer (OzoneManagerRatisServer.java:addRaftPeer(434)) 
> - Added OM omNode-1 to Ratis Peers list.
> 2024-04-10 17:31:36,906 [grpc-default-executor-2] INFO  om.OzoneManager 
> (OzoneManager.java:addOMNodeToPeers(2042)) - Added OM omNode-1 to the Peer 
> list.
> 2024-04-10 17:31:36,909 [grpc-default-executor-2] INFO  
> impl.SnapshotInstallationHandler 
> (SnapshotInstallationHandler.java:installSnapshot(103)) - 
> omNode-bootstrap-1@group-0AAC5367B30E: reply installSnapshot: 
> omNode-1<-omNode-bootstrap-1#0:OK-t0,ALREADY_INSTALLED,snapshotIndex=0
> 2024-04-10 17:31:36,922 [grpc-default-executor-2] INFO  
> server.GrpcServerProtocolService 
> (GrpcServerProtocolService.java:onCompleted(200)) - omNode-bootstrap-1: 
> Completed INSTALL_SNAPSHOT, lastRequest: 
> omNode-1->omNode-bootstrap-1#0-t1,notify:(t:1, i:0)
> 2024-04-10 17:31:36,923 [grpc-default-executor-2] INFO  
> server.GrpcServerProtocolService 
> (GrpcServerProtocolService.java:lambda$onCompleted$7(202)) - 
> omNode-bootstrap-1: Completed INSTALL_SNAPSHOT, lastReply: null
> 2024-04-10 17:31:36,924 [grpc-default-executor-0] INFO  
> server.GrpcLogAppender (GrpcLogAppender.java:onNext(658)) - 
> omNode-1@group-0AAC5367B30E->omNode-bootstrap-1-InstallSnapshotResponseHandler:
>  received the first reply 
> omNode-1<-omNode-bootstrap-1#0:OK-t0,ALREADY_INSTALLED,snapshotIndex=0
> 2024-04-10 17:31:36,929 [grpc-default-executor-0] INFO  
> server.GrpcLogAppender (GrpcLogAppender.java:onNext(679)) - 
> omNode-1@group-0AAC5367B30E->omNode-bootstrap-1-InstallSnapshotResponseHandler:
>  Follower snapshot is already at index 0.
> 2024-04-10 17:31:36,930 [grpc-default-executor-0] INFO  leader.FollowerInfo 
> (FollowerInfoImpl.java:info(64)) - 
> omNode-1@group-0AAC5367B30E->omNode-bootstrap-1: matchIndex: 
> setUnconditionally -1 -> 0
> 2024-04-10 17:31:36,930 [grpc-default-executor-0] INFO  leader.FollowerInfo 
> (FollowerInfoImpl.java:info(64)) - 
> omNode-1@group-0AAC5367B30E->omNode-bootstrap-1: nextIndex: 
> setUnconditionally 0 -> 1 {code}
> After the commit, a new node seems to have to install a snapshot (empty).
> {code:java}
> 2024-04-10 17:46:58,830 [grpc-default-executor-8] INFO  
> ratis.OzoneManagerStateMachine 
> (OzoneManagerStateMachine.java:notifyConfigurationChanged(212)) - Received 
> Configuration change notification from Ratis. New Peer list:
> [id: "omNode-1"
> address: "localhost:15015"
> startupRole: FOLLOWER
> ]
> 2024-04-10 17:46:58,842 [grpc-default-executor-8] INFO  
> ratis.OzoneManagerRatisServer (OzoneManagerRatisServer.java:addRaftPeer(434)) 
> - Added OM omNode-1 to Ratis Peers list.
> 2024-04-10 17:46:58,842 [grpc-default-executor-8] INFO  om.OzoneManager 
> (OzoneManager.java:addOMNodeToPeers(2042)) - Added OM omNode-1 to the Peer 
> list.
> 2024-04-10 17:46:58,847 [grpc-default-executor-8] INFO  
> impl.SnapshotInstallationHandler 
> (SnapshotInstallationHandler.java:installSnapshot(103)) - 
> omNode-bootstrap-1@group-0AAC5367B30E: reply installSnapshot: 
> omNode-1<-omNode-bootstrap-1#0:FAIL-t0,IN_PROGRESS,snapshotIndex=0
> 2024-04-10 17:46:58,862 [omNode-bootstrap-1-InstallSnapshotThread] INFO  
> ratis_snapshot.OmRatisSnapshotProvider 
> (OmRatisSnapshotProvider.java:downloadSnapshot(146)) - Downloading latest 
> checkpoint from Leader OM omNode-1. Checkpoint URL: 
> http://127.0.0.1:15013/dbCheckpoint?includeSnapshotData=true&flushBeforeCheckpoint=true
> 2024-04-10 17:46:58,870 [grpc-default-executor-8] INFO  
> server.GrpcServerProtocolService 
> (GrpcServerProtocolService.java:onCompleted(200)) - omNode-bootstrap-1: 
> Completed INSTALL_SNAPSHOT, lastRequest: 
> omNode-1->omNode-bootstrap-1#0-t1,notify:(t:1, i:0)
> 2024-04-10 17:46:58,871 [grpc-default-executor-7] INFO  
> server.GrpcLogAppender (GrpcLogAppender.java:onNext(658)) - 
> omNode-1@group-0AAC5367B30E->omNode-bootstrap-1-InstallSnapshotResponseHandler:
>  received the first reply 
> omNode-1<-omNode-bootstrap-1#0:FAIL-t0,IN_PROGRESS,snapshotIndex=0
> 2024-04-10 17:46:58,875 [grpc-default-executor-7] INFO  
> server.GrpcLogAppender (GrpcLogAppender.java:onNext(674)) - 
> omNode-1@group-0AAC5367B30E->omNode-bootstrap-1-InstallSnapshotResponseHandler:
>  InstallSnapshot in progress.
> 2024-04-10 17:46:58,876 [grpc-default-executor-8] INFO  
> server.GrpcServerProtocolService 
> (GrpcServerProtocolService.java:lambda$onCompleted$7(202)) - 
> omNode-bootstrap-1: Completed INSTALL_SNAPSHOT, lastReply: null {code}
> And this snapshot installation seems to always fail, because no checkpoints 
> is created (because there is no logs?).
> {code:java}
> 2024-04-10 18:13:10,601 [qtp1588976146-447] ERROR utils.DBCheckpointServlet 
> (DBCheckpointServlet.java:generateSnapshotCheckpoint(238)) - Unable to 
> process metadata snapshot request.
> java.nio.file.NoSuchFileException: 
> /Users/duong/workspaces/secondary/ozone2/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-59c4c0e4-c6ce-49e7-a03a-2f973a460919/ozone-meta/db.snapshots
>         at 
> sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
>         at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>         at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
>         at 
> sun.nio.fs.UnixFileSystemProvider.newDirectoryStream(UnixFileSystemProvider.java:407)
>         at java.nio.file.Files.newDirectoryStream(Files.java:457)
>         at java.nio.file.Files.list(Files.java:3451)
>         at 
> org.apache.hadoop.ozone.om.OMDBCheckpointServlet.processDir(OMDBCheckpointServlet.java:390)
>         at 
> org.apache.hadoop.ozone.om.OMDBCheckpointServlet.getFilesForArchive(OMDBCheckpointServlet.java:322)
>         at 
> org.apache.hadoop.ozone.om.OMDBCheckpointServlet.writeDbDataToStream(OMDBCheckpointServlet.java:172)
>         at 
> org.apache.hadoop.hdds.utils.DBCheckpointServlet.generateSnapshotCheckpoint(DBCheckpointServlet.java:220)
>         at 
> org.apache.hadoop.hdds.utils.DBCheckpointServlet.doPost(DBCheckpointServlet.java:321)
>         at javax.servlet.http.HttpServlet.service(HttpServlet.java:523)
>         at javax.servlet.http.HttpServlet.service(HttpServlet.java:590)
>         at 
> org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:799)
>         at 
> org.eclipse.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1656)
>         at 
> org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:110)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to