[ 
https://issues.apache.org/jira/browse/HDDS-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17637467#comment-17637467
 ] 

Ethan Rose commented on HDDS-4226:
----------------------------------

We have seen a similar issue in a cluster recently where the snapshot directory 
filled up the disk. Unfortunately the cluster was wiped before any evidence 
could be retrieved. The snapshots the follower downloads from the leader are 
never deleted if the move fails for any reason.

> Cleanup OM snapshots left after a failed installSnapshot
> --------------------------------------------------------
>
>                 Key: HDDS-4226
>                 URL: https://issues.apache.org/jira/browse/HDDS-4226
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: Ozone Manager
>            Reporter: Mukul Kumar Singh
>            Priority: Major
>              Labels: MiniOzoneChaosCluster
>
> Ozonemanager tries to install the snapshot
> {code:java}
> 2020-09-09 22:07:14,830 [pool-144-thread-1] INFO  om.OzoneManager 
> (OzoneManager.java:installCheckpoint(3159)) - Installing checkpoint with 
> OMTransactionInfo 2#68754
> 2020-09-09 22:07:14,831 [grpc-default-executor-50] INFO  impl.RaftServerImpl 
> (RaftServerImpl.java:installSnapshot(1127)) - omNode-2@group-D62218D261DE: 
> reply installSnapshot: omNode-1<-omNode-2#0:FAIL-t2,IN
> _PROGRESS
> {code}
> It failed because of the issues from HDDS-4224.
> {code:java}
> 2020-09-09 22:07:14,831 [pool-144-thread-1] ERROR om.OzoneManager 
> (OzoneManager.java:installSnapshotFromLeader(3141)) - Failed to install 
> snapshot from Leader OM: {}
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.ozone.om.OzoneManager.installCheckpoint(OzoneManager.java:3168)
>         at 
> org.apache.hadoop.ozone.om.OzoneManager.installCheckpoint(OzoneManager.java:3162)
>         at 
> org.apache.hadoop.ozone.om.OzoneManager.installSnapshotFromLeader(OzoneManager.java:3139)
>         at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$notifyInstallSnapshotFromLeader$4(OzoneManagerStateMachine.java:372)
>         at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> {code}
>  
> The checkpoint is left in the snapshot directory.
> {code:java}
> ➜  chaos-2020-09-09-22-05-33-IST ls 
> MiniOzoneClusterImpl-71baac34-2321-4756-ba1e-5834c5628047/omNode-2/ratis/snapshot/om.db-omNode-1-1599669
> om.db-omNode-1-1599669432684/  om.db-omNode-1-1599669451421/  
> om.db-omNode-1-1599669478149/  om.db-omNode-1-1599669504818/  
> om.db-omNode-1-1599669533577/  om.db-omNode-1-1599669566509/
> om.db-omNode-1-1599669433775/  om.db-omNode-1-1599669453030/  
> om.db-omNode-1-1599669480273/  om.db-omNode-1-1599669507385/  
> om.db-omNode-1-1599669535603/  om.db-omNode-1-1599669568325/
> om.db-omNode-1-1599669434867/  om.db-omNode-1-1599669454688/  
> om.db-omNode-1-1599669482206/  om.db-omNode-1-1599669509373/  
> om.db-omNode-1-1599669537716/  om.db-omNode-1-1599669570186/
> om.db-omNode-1-1599669435886/  om.db-omNode-1-1599669456346/  
> om.db-omNode-1-1599669484256/  om.db-omNode-1-1599669511241/  
> om.db-omNode-1-1599669540574/  om.db-omNode-1-1599669572150/
> om.db-omNode-1-1599669437199/  om.db-omNode-1-1599669458194/  
> om.db-omNode-1-1599669486200/  om.db-omNode-1-1599669513051/  
> om.db-omNode-1-1599669543136/  om.db-omNode-1-1599669574811/
> om.db-omNode-1-1599669438519/  om.db-omNode-1-1599669459992/  
> om.db-omNode-1-1599669487968/  om.db-omNode-1-1599669515343/  
> om.db-omNode-1-1599669546272/  om.db-omNode-1-1599669576833/
> om.db-omNode-1-1599669439819/  om.db-omNode-1-1599669461897/  
> om.db-omNode-1-1599669490218/  om.db-omNode-1-1599669517332/  
> om.db-omNode-1-1599669548363/  om.db-omNode-1-1599669578680/
> om.db-omNode-1-1599669441209/  om.db-omNode-1-1599669463871/  
> om.db-omNode-1-1599669492005/  om.db-omNode-1-1599669519320/  
> om.db-omNode-1-1599669551596/  om.db-omNode-1-1599669580427/
> om.db-omNode-1-1599669442606/  om.db-omNode-1-1599669465810/  
> om.db-omNode-1-1599669493727/  om.db-omNode-1-1599669521491/  
> om.db-omNode-1-1599669554153/  om.db-omNode-1-1599669582124/
> om.db-omNode-1-1599669443967/  om.db-omNode-1-1599669467909/  
> om.db-omNode-1-1599669495587/  om.db-omNode-1-1599669523436/  
> om.db-omNode-1-1599669556370/  om.db-omNode-1-1599669583768/
> om.db-omNode-1-1599669445468/  om.db-omNode-1-1599669470054/  
> om.db-omNode-1-1599669497445/  om.db-omNode-1-1599669525567/  
> om.db-omNode-1-1599669558461/  om.db-omNode-1-1599669585501/
> om.db-omNode-1-1599669446937/  om.db-omNode-1-1599669472125/  
> om.db-omNode-1-1599669499362/  om.db-omNode-1-1599669527648/  
> om.db-omNode-1-1599669560578/
> om.db-omNode-1-1599669448360/  om.db-omNode-1-1599669474051/  
> om.db-omNode-1-1599669501269/  om.db-omNode-1-1599669529648/  
> om.db-omNode-1-1599669562666/
> om.db-omNode-1-1599669449867/  om.db-omNode-1-1599669476078/  
> om.db-omNode-1-1599669503036/  om.db-omNode-1-1599669531573/  
> om.db-omNode-1-1599669564620/ {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to