[jira] [Work logged] (HDDS-1649) On installSnapshot notification from OM leader, download checkpoint and reload OM state

ASF GitHub Bot (JIRA) Thu, 18 Jul 2019 15:05:30 -0700


     [ 
https://issues.apache.org/jira/browse/HDDS-1649?focusedWorklogId=279373&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-279373
 ]


ASF GitHub Bot logged work on HDDS-1649:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 18/Jul/19 22:04
            Start Date: 18/Jul/19 22:04
    Worklog Time Spent: 10m 
      Work Description: hanishakoneru commented on pull request #948: 
HDDS-1649. On installSnapshot notification from OM leader, download checkpoint 
and reload OM state
URL: https://github.com/apache/hadoop/pull/948#discussion_r305113822
 
 

 ##########
 File path: 
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestOMRatisSnapshots.java
 ##########
 @@ -0,0 +1,193 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with this
+ * work for additional information regarding copyright ownership.  The ASF
+ * licenses this file to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ * <p>
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * <p>
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+ * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+ * License for the specific language governing permissions and limitations 
under
+ * the License.
+ */
+package org.apache.hadoop.ozone.om;
+
+import org.apache.commons.lang3.RandomStringUtils;
+import org.apache.hadoop.hdds.conf.OzoneConfiguration;
+import org.apache.hadoop.ozone.MiniOzoneCluster;
+import org.apache.hadoop.ozone.MiniOzoneHAClusterImpl;
+import org.apache.hadoop.ozone.client.ObjectStore;
+import org.apache.hadoop.ozone.client.OzoneBucket;
+import org.apache.hadoop.ozone.client.OzoneClientFactory;
+import org.apache.hadoop.ozone.client.OzoneVolume;
+import org.apache.hadoop.ozone.client.VolumeArgs;
+import org.apache.hadoop.ozone.om.helpers.OmVolumeArgs;
+import org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer;
+import org.apache.hadoop.utils.db.DBCheckpoint;
+import org.apache.hadoop.utils.db.Table;
+import org.apache.hadoop.utils.db.TableIterator;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.ExpectedException;
+import org.junit.rules.Timeout;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.UUID;
+
+import static org.apache.hadoop.ozone.om.TestOzoneManagerHA.createKey;
+
+/**
+ * Tests the Ratis snaphsots feature in OM.
+ */
+public class TestOMRatisSnapshots {
+
+  private MiniOzoneHAClusterImpl cluster = null;
+  private ObjectStore objectStore;
+  private OzoneConfiguration conf;
+  private String clusterId;
+  private String scmId;
+  private int numOfOMs = 3;
+  private static final long SNAPSHOT_THRESHOLD = 50;
+  private static final int LOG_PURGE_GAP = 50;
+
+  @Rule
+  public ExpectedException exception = ExpectedException.none();
+
+  @Rule
+  public Timeout timeout = new Timeout(3000_000);
+
+  /**
+   * Create a MiniDFSCluster for testing. The cluster initially has one
+   * inactive OM. So at the start of the cluster, there will be 2 active and 1
+   * inactive OM.
+   *
+   * @throws IOException
+   */
+  @Before
+  public void init() throws Exception {
+    conf = new OzoneConfiguration();
+    clusterId = UUID.randomUUID().toString();
+    scmId = UUID.randomUUID().toString();
+    conf.setLong(
+        OMConfigKeys.OZONE_OM_RATIS_SNAPSHOT_AUTO_TRIGGER_THRESHOLD_KEY,
+        SNAPSHOT_THRESHOLD);
+    conf.setInt(OMConfigKeys.OZONE_OM_RATIS_LOG_PURGE_GAP, LOG_PURGE_GAP);
+    cluster = (MiniOzoneHAClusterImpl) MiniOzoneCluster.newHABuilder(conf)
+        .setClusterId(clusterId)
+        .setScmId(scmId)
+        .setOMServiceId("om-service-test1")
+        .setNumOfOzoneManagers(numOfOMs)
+        .setNumOfActiveOMs(2)
+        .build();
+    cluster.waitForClusterToBeReady();
+    objectStore = OzoneClientFactory.getRpcClient(conf).getObjectStore();
+  }
+
+  /**
+   * Shutdown MiniDFSCluster.
+   */
+  @After
+  public void shutdown() {
+    if (cluster != null) {
+      cluster.shutdown();
+    }
+  }
+
+  @Test
+  public void testInstallSnapshot() throws Exception {
+    // Get the leader OM
+    String leaderOMNodeId = objectStore.getClientProxy().getOMProxyProvider()
+        .getCurrentProxyOMNodeId();
+    OzoneManager leaderOM = cluster.getOzoneManager(leaderOMNodeId);
+    OzoneManagerRatisServer leaderRatisServer = leaderOM.getOmRatisServer();
+
+    // Find the inactive OM
+    String followerNodeId = leaderOM.getPeerNodes().get(0).getOMNodeId();
+    if (cluster.isOMActive(followerNodeId)) {
+      followerNodeId = leaderOM.getPeerNodes().get(1).getOMNodeId();
+    }
+    OzoneManager followerOM = cluster.getOzoneManager(followerNodeId);
+
+    // Do some transactions so that the log index increases
+    String userName = "user" + RandomStringUtils.randomNumeric(5);
+    String adminName = "admin" + RandomStringUtils.randomNumeric(5);
+    String volumeName = "volume" + RandomStringUtils.randomNumeric(5);
+    String bucketName = "bucket" + RandomStringUtils.randomNumeric(5);
+
+    VolumeArgs createVolumeArgs = VolumeArgs.newBuilder()
+        .setOwner(userName)
+        .setAdmin(adminName)
+        .build();
+
+    objectStore.createVolume(volumeName, createVolumeArgs);
+    OzoneVolume retVolumeinfo = objectStore.getVolume(volumeName);
+
+    retVolumeinfo.createBucket(bucketName);
+    OzoneBucket ozoneBucket = retVolumeinfo.getBucket(bucketName);
+
+    long leaderOMappliedLogIndex =
+        leaderRatisServer.getStateMachineLastAppliedIndex();
+    leaderOM.getOmRatisServer().getStateMachineLastAppliedIndex();
+
+    List<String> keys = new ArrayList<>();
+    while (leaderOMappliedLogIndex < 2000) {
+      keys.add(createKey(ozoneBucket));
+      leaderOMappliedLogIndex =
+          leaderRatisServer.getStateMachineLastAppliedIndex();
+    }
+
+    // Get the latest db checkpoint from the leader OM.
+    long leaderOMSnaphsotIndex = leaderOM.saveRatisSnapshot();
+    DBCheckpoint leaderDbCheckpoint =
+        leaderOM.getMetadataManager().getStore().getCheckpoint(false);
+
+    // Start the inactive OM
+    cluster.startInactiveOM(followerNodeId);
+
+    // The recently started OM should be lagging behind the leader OM.
+    long followerOMLastAppliedIndex =
+        followerOM.getOmRatisServer().getStateMachineLastAppliedIndex();
+    Assert.assertTrue(
+        followerOMLastAppliedIndex < leaderOMSnaphsotIndex);
+
+    // Install leader OM's db checkpoint on the lagging OM.
+    followerOM.getOmRatisServer().getOmStateMachine().pause();
+    followerOM.getMetadataManager().getStore().close();
 
 Review comment:
   Yes this was for the test only. The problem with testing end to end is that 
HttpGet does not work for unit tests. The CheckpointServlet (from Recon) which 
we use for downloading the checkpoint from leader uses HttpGet for the 
checkpoint transfer.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 279373)
    Time Spent: 9h 50m  (was: 9h 40m)

> On installSnapshot notification from OM leader, download checkpoint and 
> reload OM state
> ---------------------------------------------------------------------------------------
>
>                 Key: HDDS-1649
>                 URL: https://issues.apache.org/jira/browse/HDDS-1649
>             Project: Hadoop Distributed Data Store
>          Issue Type: Sub-task
>            Reporter: Hanisha Koneru
>            Assignee: Hanisha Koneru
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 9h 50m
>  Remaining Estimate: 0h
>
> Installing a DB checkpoint on the OM involves following steps:
>  1. When an OM follower receives installSnapshot notification from OM leader, 
> it should initiate a new checkpoint on the OM leader and download that 
> checkpoint through Http. 
>  2. After downloading the checkpoint, the StateMachine must be paused so that 
> the old OM DB can be replaced with the new downloaded checkpoint. 
>  3. The OM should be reloaded with the new state . All the services having a 
> dependency on the OM DB (such as MetadataManager, KeyManager etc.) must be 
> re-initialized/ restarted. 
>  4. Once the OM is ready with the new state, the state machine must be 
> unpaused to resume participating in the Ratis ring.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDDS-1649) On installSnapshot notification from OM leader, download checkpoint and reload OM state

Reply via email to