[GitHub] [ozone] bshashikant commented on a change in pull request #2294: HDDS-5263. SCM may stay in safe mode forever after a unclean shutdown of SCM.

GitBox Tue, 08 Jun 2021 23:34:58 -0700


bshashikant commented on a change in pull request #2294:
URL: https://github.com/apache/ozone/pull/2294#discussion_r648000416




##########
File path: 
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java
##########
@@ -297,6 +308,23 @@ public void notifyTermIndexUpdated(long term, long index) {
     // with some information like its peers and termIndex). So, calling
     // updateLastApplied updates lastAppliedTermIndex.
     updateLastAppliedTermIndex(term, index);
+
+    if (currentLeaderTerm.get() == term &&

Review comment:
       Can we add comments here on why the logic is necessary?

##########
File path: 
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java
##########
@@ -173,6 +187,7 @@ public void notifyNotLeader(Collection<TransactionContext> 
pendingEntries) {
 
     scm.getScmContext().updateLeaderAndTerm(false, 0);
     scm.getSCMServiceManager().notifyStatusChanged();
+

Review comment:
       Unintended change

##########
File path: 
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMContext.java
##########
@@ -96,6 +96,38 @@ public void updateLeaderAndTerm(boolean leader, long 
newTerm) {
     }
   }
 
+  /**
+   * Update isLeader flag.
+   * @param leader
+   */
+  public void updateLeader(boolean leader) {
+    lock.writeLock().lock();
+    try {

Review comment:
       This looks a bit confusing. It should be ideally leaderReady. Let's 
maintain notion of leader and leaderReady separately.

##########
File path: 
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java
##########
@@ -132,6 +135,17 @@ public void initialize(RaftServer server, RaftGroupId id,
       final SCMRatisRequest request = SCMRatisRequest.decode(
           Message.valueOf(trx.getStateMachineLogEntry().getLogData()));
       applyTransactionFuture.complete(process(request));
+      // After restart ratis replay logs from last snapshot index.
+      // So if some transactions which need to be updated to DB will not be
+      // applied to DB. After a restart of SCM container/pipeline managers
+      // have setup the safemode rules with not to update DB. Due to this
+      // some time safemode rules are not validated and SCM does not exit
+      // safe mode. So, once after restart as transactions are applied, we
+      // check whether safe mode rules are validated to solve the issue of
+      // SCM not coming out of safemode.
+      if (scm.isInSafeMode()) {
+        scm.getScmSafeModeManager().refreshAndValidate();

Review comment:
       This check should happen irrespective of, whether its leader or not 
right?

##########
File path: 
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/HealthyPipelineSafeModeRule.java
##########
@@ -130,30 +124,61 @@ protected void process(Pipeline pipeline) {
       SCMSafeModeManager.getLogger().info(
           "SCM in safe mode. Healthy pipelines reported count is {}, " +
               "required healthy pipeline reported count is {}",
-          currentHealthyPipelineCount, healthyPipelineThresholdCount);
+          currentHealthyPipelineCount, getHealthyPipelineThresholdCount());
+
     }
   }
 
+
+  public synchronized void refresh() {
+    if (!validate()) {
+      initializeRule(true);
+    }
+  }
+
+  private synchronized void initializeRule(boolean refresh) {
+    int pipelineCount = pipelineManager.getPipelines(
+        new RatisReplicationConfig(HddsProtos.ReplicationFactor.THREE),
+        Pipeline.PipelineState.OPEN).size();
+
+    healthyPipelineThresholdCount = Math.max(minHealthyPipelines,
+        (int) Math.ceil(healthyPipelinesPercent * pipelineCount));
+
+    if (refresh) {
+      LOG.info("Refreshed total pipeline count is {}, healthy pipeline " +
+          "threshold count is {}", pipelineCount,
+          healthyPipelineThresholdCount);
+    } else {
+      LOG.info("Total pipeline count is {}, healthy pipeline " +
+          "threshold count is {}", pipelineCount,
+          healthyPipelineThresholdCount);
+    }
+
+    getSafeModeMetrics().setNumHealthyPipelinesThreshold(
+        healthyPipelineThresholdCount);
+  }
+
+
   @Override
-  protected void cleanup() {
+  protected synchronized void cleanup() {
     processedPipelineIDs.clear();
   }
 
   @VisibleForTesting
-  public int getCurrentHealthyPipelineCount() {
+  public synchronized int getCurrentHealthyPipelineCount() {

Review comment:
       Why these calls need to be synchronized()?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [ozone] bshashikant commented on a change in pull request #2294: HDDS-5263. SCM may stay in safe mode forever after a unclean shutdown of SCM.

Reply via email to