[
https://issues.apache.org/jira/browse/HDDS-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17637467#comment-17637467
]
Ethan Rose commented on HDDS-4226:
----------------------------------
We have seen a similar issue in a cluster recently where the snapshot directory
filled up the disk. Unfortunately the cluster was wiped before any evidence
could be retrieved. The snapshots the follower downloads from the leader are
never deleted if the move fails for any reason.
> Cleanup OM snapshots left after a failed installSnapshot
> --------------------------------------------------------
>
> Key: HDDS-4226
> URL: https://issues.apache.org/jira/browse/HDDS-4226
> Project: Apache Ozone
> Issue Type: Bug
> Components: Ozone Manager
> Reporter: Mukul Kumar Singh
> Priority: Major
> Labels: MiniOzoneChaosCluster
>
> Ozonemanager tries to install the snapshot
> {code:java}
> 2020-09-09 22:07:14,830 [pool-144-thread-1] INFO om.OzoneManager
> (OzoneManager.java:installCheckpoint(3159)) - Installing checkpoint with
> OMTransactionInfo 2#68754
> 2020-09-09 22:07:14,831 [grpc-default-executor-50] INFO impl.RaftServerImpl
> (RaftServerImpl.java:installSnapshot(1127)) - omNode-2@group-D62218D261DE:
> reply installSnapshot: omNode-1<-omNode-2#0:FAIL-t2,IN
> _PROGRESS
> {code}
> It failed because of the issues from HDDS-4224.
> {code:java}
> 2020-09-09 22:07:14,831 [pool-144-thread-1] ERROR om.OzoneManager
> (OzoneManager.java:installSnapshotFromLeader(3141)) - Failed to install
> snapshot from Leader OM: {}
> java.lang.NullPointerException
> at
> org.apache.hadoop.ozone.om.OzoneManager.installCheckpoint(OzoneManager.java:3168)
> at
> org.apache.hadoop.ozone.om.OzoneManager.installCheckpoint(OzoneManager.java:3162)
> at
> org.apache.hadoop.ozone.om.OzoneManager.installSnapshotFromLeader(OzoneManager.java:3139)
> at
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$notifyInstallSnapshotFromLeader$4(OzoneManagerStateMachine.java:372)
> at
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}
>
> The checkpoint is left in the snapshot directory.
> {code:java}
> ➜ chaos-2020-09-09-22-05-33-IST ls
> MiniOzoneClusterImpl-71baac34-2321-4756-ba1e-5834c5628047/omNode-2/ratis/snapshot/om.db-omNode-1-1599669
> om.db-omNode-1-1599669432684/ om.db-omNode-1-1599669451421/
> om.db-omNode-1-1599669478149/ om.db-omNode-1-1599669504818/
> om.db-omNode-1-1599669533577/ om.db-omNode-1-1599669566509/
> om.db-omNode-1-1599669433775/ om.db-omNode-1-1599669453030/
> om.db-omNode-1-1599669480273/ om.db-omNode-1-1599669507385/
> om.db-omNode-1-1599669535603/ om.db-omNode-1-1599669568325/
> om.db-omNode-1-1599669434867/ om.db-omNode-1-1599669454688/
> om.db-omNode-1-1599669482206/ om.db-omNode-1-1599669509373/
> om.db-omNode-1-1599669537716/ om.db-omNode-1-1599669570186/
> om.db-omNode-1-1599669435886/ om.db-omNode-1-1599669456346/
> om.db-omNode-1-1599669484256/ om.db-omNode-1-1599669511241/
> om.db-omNode-1-1599669540574/ om.db-omNode-1-1599669572150/
> om.db-omNode-1-1599669437199/ om.db-omNode-1-1599669458194/
> om.db-omNode-1-1599669486200/ om.db-omNode-1-1599669513051/
> om.db-omNode-1-1599669543136/ om.db-omNode-1-1599669574811/
> om.db-omNode-1-1599669438519/ om.db-omNode-1-1599669459992/
> om.db-omNode-1-1599669487968/ om.db-omNode-1-1599669515343/
> om.db-omNode-1-1599669546272/ om.db-omNode-1-1599669576833/
> om.db-omNode-1-1599669439819/ om.db-omNode-1-1599669461897/
> om.db-omNode-1-1599669490218/ om.db-omNode-1-1599669517332/
> om.db-omNode-1-1599669548363/ om.db-omNode-1-1599669578680/
> om.db-omNode-1-1599669441209/ om.db-omNode-1-1599669463871/
> om.db-omNode-1-1599669492005/ om.db-omNode-1-1599669519320/
> om.db-omNode-1-1599669551596/ om.db-omNode-1-1599669580427/
> om.db-omNode-1-1599669442606/ om.db-omNode-1-1599669465810/
> om.db-omNode-1-1599669493727/ om.db-omNode-1-1599669521491/
> om.db-omNode-1-1599669554153/ om.db-omNode-1-1599669582124/
> om.db-omNode-1-1599669443967/ om.db-omNode-1-1599669467909/
> om.db-omNode-1-1599669495587/ om.db-omNode-1-1599669523436/
> om.db-omNode-1-1599669556370/ om.db-omNode-1-1599669583768/
> om.db-omNode-1-1599669445468/ om.db-omNode-1-1599669470054/
> om.db-omNode-1-1599669497445/ om.db-omNode-1-1599669525567/
> om.db-omNode-1-1599669558461/ om.db-omNode-1-1599669585501/
> om.db-omNode-1-1599669446937/ om.db-omNode-1-1599669472125/
> om.db-omNode-1-1599669499362/ om.db-omNode-1-1599669527648/
> om.db-omNode-1-1599669560578/
> om.db-omNode-1-1599669448360/ om.db-omNode-1-1599669474051/
> om.db-omNode-1-1599669501269/ om.db-omNode-1-1599669529648/
> om.db-omNode-1-1599669562666/
> om.db-omNode-1-1599669449867/ om.db-omNode-1-1599669476078/
> om.db-omNode-1-1599669503036/ om.db-omNode-1-1599669531573/
> om.db-omNode-1-1599669564620/ {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]