[
https://issues.apache.org/jira/browse/HDDS-11470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883060#comment-17883060
]
Tsz-wo Sze commented on HDDS-11470:
-----------------------------------
Below is an example that om2 replied "Completed INSTALL_SNAPSHOT" but it
actually had failed to move downloaded DB checkpoint due to a UnixException
"Invalid cross-device link".
{code}
2024-09-18 09:28:20,365 INFO
[grpc-default-executor-1]-org.apache.ratis.grpc.server.GrpcServerProtocolService:
om2: Completed INSTALL_SNAPSHOT, lastReply: null
2024-09-18 09:28:20,365 INFO
[pool-33-thread-1]-org.apache.hadoop.ozone.om.OzoneManager: metadataManager is
stopped. Spend 7 ms.
2024-09-18 09:28:20,367 ERROR
[pool-33-thread-1]-org.apache.hadoop.ozone.om.OzoneManager: Failed to move
downloaded DB checkpoint /var/lib/hadoop-ozone/om/ozone-metaot/om.db.candidate
to metadata directory /ozone/hadoop-ozone/om/data/om.db. Exception: {}.
Resetting to original DB.
java.nio.file.FileSystemException: /ozone/hadoop-ozone/om/data/om.db/000044.sst
-> /var/lib/hadoop-ozone/om/ozone-metadata/snapshot/om.db.candidate/000044.sst:
Invalid cross-device link
at
java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:100)
at
java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
at
java.base/sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:477)
at java.base/java.nio.file.Files.createLink(Files.java:1101)
at
org.apache.hadoop.ozone.om.snapshot.OmSnapshotUtils.linkFiles(OmSnapshotUtils.java:169)
at
org.apache.hadoop.ozone.om.OzoneManager.moveCheckpointFiles(OzoneManager.java:3884)
at
org.apache.hadoop.ozone.om.OzoneManager.replaceOMDBWithCheckpoint(OzoneManager.java:3864)
at
org.apache.hadoop.ozone.om.OzoneManager.installCheckpoint(OzoneManager.java:3738)
at
org.apache.hadoop.ozone.om.OzoneManager.installCheckpoint(OzoneManager.java:3673)
at
org.apache.hadoop.ozone.om.OzoneManager.installSnapshotFromLeader(OzoneManager.java:3650)
at
org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$5(OzoneManagerStateMachine.java:505)
at
java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
{code}
> OM should not reply Completed INSTALL_SNAPSHOT when installCheckpoint failed
> ----------------------------------------------------------------------------
>
> Key: HDDS-11470
> URL: https://issues.apache.org/jira/browse/HDDS-11470
> Project: Apache Ozone
> Issue Type: Bug
> Components: OM HA
> Reporter: Tsz-wo Sze
> Priority: Major
>
> When OM failed to installCheckpoint (e.g. HDDS-10300), it should not reply
> "Completed INSTALL_SNAPSHOT".
> In the code below, when there is an exception, it just print an error message
> and continue to reply "Completed INSTALL_SNAPSHOT".
> {code}
> //OzoneManager.installCheckpoint
> try {
> time = Time.monotonicNow();
> dbBackup = replaceOMDBWithCheckpoint(lastAppliedIndex,
> oldDBLocation, checkpointLocation);
> term = checkpointTrxnInfo.getTerm();
> lastAppliedIndex = checkpointTrxnInfo.getTransactionIndex();
> LOG.info("Replaced DB with checkpoint from OM: {}, term: {}, " +
> "index: {}, time: {} ms", leaderId, term, lastAppliedIndex,
> Time.monotonicNow() - time);
> } catch (Exception e) {
> LOG.error("Failed to install Snapshot from {} as OM failed to
> replace" +
> " DB with downloaded checkpoint. Reloading old OM state.",
> leaderId, e);
> }
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]