[ 
https://issues.apache.org/jira/browse/HDDS-1649?focusedWorklogId=278526&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-278526
 ]

ASF GitHub Bot logged work on HDDS-1649:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 17/Jul/19 22:12
            Start Date: 17/Jul/19 22:12
    Worklog Time Spent: 10m 
      Work Description: hanishakoneru commented on pull request #948: 
HDDS-1649. On installSnapshot notification from OM leader, download checkpoint 
and reload OM state
URL: https://github.com/apache/hadoop/pull/948#discussion_r304663826
 
 

 ##########
 File path: 
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneManager.java
 ##########
 @@ -3122,6 +3136,131 @@ public boolean setAcl(OzoneObj obj, List<OzoneAcl> 
acls) throws IOException {
     }
   }
 
+  /**
+   * Download and install latest checkpoint from leader OM.
+   * If the download checkpoints snapshot index is greater than this OM's
+   * last applied transaction index, then re-initialize the OM state via this
+   * checkpoint. Before re-initializing OM state, the OM Ratis server should
+   * be stopped so that no new transactions can be applied.
+   * @param leaderId peerNodeID of the leader OM
+   * @return If checkpoint is installed, return the corresponding termIndex.
+   * Otherwise, return null.
+   */
+  public TermIndex installSnapshot(String leaderId) {
+    if (omSnapshotProvider == null) {
+      LOG.error("OM Snapshot Provider is not configured as there are no peer " 
+
+          "nodes.");
+      return null;
+    }
+
+    DBCheckpoint omDBcheckpoint;
+    try {
+      omDBcheckpoint = omSnapshotProvider.getOzoneManagerDBSnapshot(leaderId);
+    } catch (IOException e) {
+      LOG.error("Failed to download checkpoint from OM leader {}", leaderId, 
e);
+      return null;
+    }
+
+    // Check if current ratis log index is smaller than the downloaded
+    // snapshot index. If yes, proceed by stopping the ratis server so that
+    // the OM state can be re-initialized. If no, then do not proceed with
+    // installSnapshot.
+    long lastAppliedIndex = omRatisServer.getStateMachineLastAppliedIndex();
+    long checkpointSnapshotIndex = omDBcheckpoint.getRatisSnapshotIndex();
+    if (checkpointSnapshotIndex <= lastAppliedIndex) {
+      LOG.error("Failed to install checkpoint from OM leader: {}. The last " +
+          "applied index: {} is greater than or equal to the checkpoint's " +
+          "snapshot index: {}", leaderId, lastAppliedIndex,
+          checkpointSnapshotIndex);
+      return null;
+    }
+
+    // Stop the ratis server so that no new transactions are applied. This
+    // can happen if a leader election happens while the state is being
+    // re-initialized.
+    omRatisServer.stop();
+
+    // Clear the OM Double Buffer so that if there are any pending
+    // transactions in the buffer, they are discarded.
+    omDoubleBuffer.stop();
+
+    // Take a backup of the current DB
+    File dbFile = metadataManager.getStore().getDbLocation();
+    String dbBackupFileName = OzoneConsts.OM_DB_BACKUP_PREFIX +
+        lastAppliedIndex + "_" + System.currentTimeMillis();
+    File dbBackupFile = new File(dbFile.getParentFile(), dbBackupFileName);
+
+    try {
+      Files.move(dbFile.toPath(), dbBackupFile.toPath());
+    } catch (IOException e) {
+      LOG.error("Failed to create a backup of the current DB. Aborting " +
+          "snapshot installation.", e);
+      return null;
+    }
+
+    // Move the downloaded DB checkpoint into the om metadata dir
+    Path checkpointPath = omDBcheckpoint.getCheckpointLocation();
+    try {
+      Files.move(checkpointPath, dbFile.toPath());
+    } catch (IOException e) {
+      LOG.error("Failed to move downloaded DB checkpoint {} to metadata " +
+          "directory {}",checkpointPath, dbFile.toPath(), e);
+      return null;
+    }
+
+    // Reload the OM DB store with the new checkpoint
+    try {
+      reloadOMState();
+    } catch (IOException e) {
+      LOG.error("Failed to reload OM state with new DB checkpoint.", e);
+      return null;
+    }
+
+    // TODO: We should only return the snpashotIndex to the leader.
 
 Review comment:
   InstallSnaphsot notification response requires the TermIndex. But we only 
have the install snapshot index. The term index is irrelevant here. We return a 
dummy (0) term index. 
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 278526)
    Time Spent: 3h  (was: 2h 50m)

> On installSnapshot notification from OM leader, download checkpoint and 
> reload OM state
> ---------------------------------------------------------------------------------------
>
>                 Key: HDDS-1649
>                 URL: https://issues.apache.org/jira/browse/HDDS-1649
>             Project: Hadoop Distributed Data Store
>          Issue Type: Sub-task
>            Reporter: Hanisha Koneru
>            Assignee: Hanisha Koneru
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 3h
>  Remaining Estimate: 0h
>
> When an OM follower receives installSnapshot notification from OM leader, it 
> should initiate a new checkpoint on the OM leader and download that 
> checkpoint.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to