[ 
https://issues.apache.org/jira/browse/HDDS-10696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843296#comment-17843296
 ] 

Hemant Kumar commented on HDDS-10696:
-------------------------------------

Issue is not because of snapshot creation for tarball. NoSuchFileException is 
thrown when it is trying to get the existing ozone snapshot's data from the 
leader and snapshot dir (aka db.snapshots) doesn't exist 
[here|https://github.com/apache/ozone/blob/2c0580dbb2edeb94ec263aa44fd1e22cff838dcc/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OMDBCheckpointServlet.java#L320].
 I think we need to add a check here is db.snapshots exist and then copy the 
snapshot data.

> Ozone integration test fails because of empty snapshot installation.
> --------------------------------------------------------------------
>
>                 Key: HDDS-10696
>                 URL: https://issues.apache.org/jira/browse/HDDS-10696
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: OM HA
>            Reporter: Duong
>            Assignee: Hemant Kumar
>            Priority: Major
>
> I believe RATIS-2045 results in a regression. A lot of Ozone integration 
> tests fail after including this commit, probably because nodes can't be added 
> to a ratis ring with no logs entries.
> Sample run: 
> [https://github.com/duongkame/ozone/actions/runs/8623155740/job/23636444671,|https://github.com/duongkame/ozone/actions/runs/8623155740/job/23636444671]
> Sample failed test: TestAddRemoveOzoneManager.testBootstrap
>  
> Before the commit, seems that new nodes don't have to install snapshots (the 
> check installSnapshot return a ALREADY_INSTALLED).
> {code:java}
> 2024-04-10 17:31:36,897 [grpc-default-executor-2] INFO  
> ratis.OzoneManagerStateMachine 
> (OzoneManagerStateMachine.java:notifyConfigurationChanged(212)) - Received 
> Configuration change notification from Ratis. New Peer list:
> [id: "omNode-1"
> address: "localhost:15015"
> startupRole: FOLLOWER
> ]
> 2024-04-10 17:31:36,905 [grpc-default-executor-2] INFO  
> ratis.OzoneManagerRatisServer (OzoneManagerRatisServer.java:addRaftPeer(434)) 
> - Added OM omNode-1 to Ratis Peers list.
> 2024-04-10 17:31:36,906 [grpc-default-executor-2] INFO  om.OzoneManager 
> (OzoneManager.java:addOMNodeToPeers(2042)) - Added OM omNode-1 to the Peer 
> list.
> 2024-04-10 17:31:36,909 [grpc-default-executor-2] INFO  
> impl.SnapshotInstallationHandler 
> (SnapshotInstallationHandler.java:installSnapshot(103)) - 
> omNode-bootstrap-1@group-0AAC5367B30E: reply installSnapshot: 
> omNode-1<-omNode-bootstrap-1#0:OK-t0,ALREADY_INSTALLED,snapshotIndex=0
> 2024-04-10 17:31:36,922 [grpc-default-executor-2] INFO  
> server.GrpcServerProtocolService 
> (GrpcServerProtocolService.java:onCompleted(200)) - omNode-bootstrap-1: 
> Completed INSTALL_SNAPSHOT, lastRequest: 
> omNode-1->omNode-bootstrap-1#0-t1,notify:(t:1, i:0)
> 2024-04-10 17:31:36,923 [grpc-default-executor-2] INFO  
> server.GrpcServerProtocolService 
> (GrpcServerProtocolService.java:lambda$onCompleted$7(202)) - 
> omNode-bootstrap-1: Completed INSTALL_SNAPSHOT, lastReply: null
> 2024-04-10 17:31:36,924 [grpc-default-executor-0] INFO  
> server.GrpcLogAppender (GrpcLogAppender.java:onNext(658)) - 
> omNode-1@group-0AAC5367B30E->omNode-bootstrap-1-InstallSnapshotResponseHandler:
>  received the first reply 
> omNode-1<-omNode-bootstrap-1#0:OK-t0,ALREADY_INSTALLED,snapshotIndex=0
> 2024-04-10 17:31:36,929 [grpc-default-executor-0] INFO  
> server.GrpcLogAppender (GrpcLogAppender.java:onNext(679)) - 
> omNode-1@group-0AAC5367B30E->omNode-bootstrap-1-InstallSnapshotResponseHandler:
>  Follower snapshot is already at index 0.
> 2024-04-10 17:31:36,930 [grpc-default-executor-0] INFO  leader.FollowerInfo 
> (FollowerInfoImpl.java:info(64)) - 
> omNode-1@group-0AAC5367B30E->omNode-bootstrap-1: matchIndex: 
> setUnconditionally -1 -> 0
> 2024-04-10 17:31:36,930 [grpc-default-executor-0] INFO  leader.FollowerInfo 
> (FollowerInfoImpl.java:info(64)) - 
> omNode-1@group-0AAC5367B30E->omNode-bootstrap-1: nextIndex: 
> setUnconditionally 0 -> 1 {code}
> After the commit, a new node seems to have to install a snapshot (empty).
> {code:java}
> 2024-04-10 17:46:58,830 [grpc-default-executor-8] INFO  
> ratis.OzoneManagerStateMachine 
> (OzoneManagerStateMachine.java:notifyConfigurationChanged(212)) - Received 
> Configuration change notification from Ratis. New Peer list:
> [id: "omNode-1"
> address: "localhost:15015"
> startupRole: FOLLOWER
> ]
> 2024-04-10 17:46:58,842 [grpc-default-executor-8] INFO  
> ratis.OzoneManagerRatisServer (OzoneManagerRatisServer.java:addRaftPeer(434)) 
> - Added OM omNode-1 to Ratis Peers list.
> 2024-04-10 17:46:58,842 [grpc-default-executor-8] INFO  om.OzoneManager 
> (OzoneManager.java:addOMNodeToPeers(2042)) - Added OM omNode-1 to the Peer 
> list.
> 2024-04-10 17:46:58,847 [grpc-default-executor-8] INFO  
> impl.SnapshotInstallationHandler 
> (SnapshotInstallationHandler.java:installSnapshot(103)) - 
> omNode-bootstrap-1@group-0AAC5367B30E: reply installSnapshot: 
> omNode-1<-omNode-bootstrap-1#0:FAIL-t0,IN_PROGRESS,snapshotIndex=0
> 2024-04-10 17:46:58,862 [omNode-bootstrap-1-InstallSnapshotThread] INFO  
> ratis_snapshot.OmRatisSnapshotProvider 
> (OmRatisSnapshotProvider.java:downloadSnapshot(146)) - Downloading latest 
> checkpoint from Leader OM omNode-1. Checkpoint URL: 
> http://127.0.0.1:15013/dbCheckpoint?includeSnapshotData=true&flushBeforeCheckpoint=true
> 2024-04-10 17:46:58,870 [grpc-default-executor-8] INFO  
> server.GrpcServerProtocolService 
> (GrpcServerProtocolService.java:onCompleted(200)) - omNode-bootstrap-1: 
> Completed INSTALL_SNAPSHOT, lastRequest: 
> omNode-1->omNode-bootstrap-1#0-t1,notify:(t:1, i:0)
> 2024-04-10 17:46:58,871 [grpc-default-executor-7] INFO  
> server.GrpcLogAppender (GrpcLogAppender.java:onNext(658)) - 
> omNode-1@group-0AAC5367B30E->omNode-bootstrap-1-InstallSnapshotResponseHandler:
>  received the first reply 
> omNode-1<-omNode-bootstrap-1#0:FAIL-t0,IN_PROGRESS,snapshotIndex=0
> 2024-04-10 17:46:58,875 [grpc-default-executor-7] INFO  
> server.GrpcLogAppender (GrpcLogAppender.java:onNext(674)) - 
> omNode-1@group-0AAC5367B30E->omNode-bootstrap-1-InstallSnapshotResponseHandler:
>  InstallSnapshot in progress.
> 2024-04-10 17:46:58,876 [grpc-default-executor-8] INFO  
> server.GrpcServerProtocolService 
> (GrpcServerProtocolService.java:lambda$onCompleted$7(202)) - 
> omNode-bootstrap-1: Completed INSTALL_SNAPSHOT, lastReply: null {code}
> And this snapshot installation seems to always fail, because no checkpoints 
> is created (because there is no logs?).
> {code:java}
> 2024-04-10 18:13:10,601 [qtp1588976146-447] ERROR utils.DBCheckpointServlet 
> (DBCheckpointServlet.java:generateSnapshotCheckpoint(238)) - Unable to 
> process metadata snapshot request.
> java.nio.file.NoSuchFileException: 
> /Users/duong/workspaces/secondary/ozone2/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-59c4c0e4-c6ce-49e7-a03a-2f973a460919/ozone-meta/db.snapshots
>         at 
> sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
>         at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>         at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
>         at 
> sun.nio.fs.UnixFileSystemProvider.newDirectoryStream(UnixFileSystemProvider.java:407)
>         at java.nio.file.Files.newDirectoryStream(Files.java:457)
>         at java.nio.file.Files.list(Files.java:3451)
>         at 
> org.apache.hadoop.ozone.om.OMDBCheckpointServlet.processDir(OMDBCheckpointServlet.java:390)
>         at 
> org.apache.hadoop.ozone.om.OMDBCheckpointServlet.getFilesForArchive(OMDBCheckpointServlet.java:322)
>         at 
> org.apache.hadoop.ozone.om.OMDBCheckpointServlet.writeDbDataToStream(OMDBCheckpointServlet.java:172)
>         at 
> org.apache.hadoop.hdds.utils.DBCheckpointServlet.generateSnapshotCheckpoint(DBCheckpointServlet.java:220)
>         at 
> org.apache.hadoop.hdds.utils.DBCheckpointServlet.doPost(DBCheckpointServlet.java:321)
>         at javax.servlet.http.HttpServlet.service(HttpServlet.java:523)
>         at javax.servlet.http.HttpServlet.service(HttpServlet.java:590)
>         at 
> org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:799)
>         at 
> org.eclipse.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1656)
>         at 
> org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:110)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to