[jira] [Work logged] (HDDS-1454) GC other system pause events can trigger pipeline destroy for all the nodes in the cluster

ASF GitHub Bot (JIRA) Tue, 18 Jun 2019 02:36:08 -0700


     [ 
https://issues.apache.org/jira/browse/HDDS-1454?focusedWorklogId=262171&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-262171
 ]


ASF GitHub Bot logged work on HDDS-1454:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 18/Jun/19 09:35
            Start Date: 18/Jun/19 09:35
    Worklog Time Spent: 10m 
      Work Description: nandakumar131 commented on pull request #852: 
HDDS-1454. GC other system pause events can trigger pipeline destroy for all 
the nodes in the cluster. Contributed by Supratim Deka
URL: https://github.com/apache/hadoop/pull/852#discussion_r294695045
 
 

 ##########
 File path: 
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeStateManager.java
 ##########
 @@ -464,6 +487,44 @@ public void setContainers(UUID uuid, Set<ContainerID> 
containerIds)
   @Override
   public void run() {
 
+    if (shouldSkipCheck()) {
+      skippedHealthChecks++;
+      LOG.info("Detected long delay in scheduling HB processing thread. "
+          + "Skipping heartbeat checks for one iteration.");
+    } else {
+      checkNodesHealth();
+    }
+
+    // we purposefully make this non-deterministic. Instead of using a
+    // scheduleAtFixedFrequency  we will just go to sleep
+    // and wake up at the next rendezvous point, which is currentTime +
+    // heartbeatCheckerIntervalMs. This leads to the issue that we are now
+    // heart beating not at a fixed cadence, but clock tick + time taken to
+    // work.
+    //
+    // This time taken to work can skew the heartbeat processor thread.
+    // The reason why we don't care is because of the following reasons.
+    //
+    // 1. checkerInterval is general many magnitudes faster than datanode HB
+    // frequency.
+    //
+    // 2. if we have too much nodes, the SCM would be doing only HB
+    // processing, this could lead to SCM's CPU starvation. With this
+    // approach we always guarantee that  HB thread sleeps for a little while.
+    //
+    // 3. It is possible that we will never finish processing the HB's in the
+    // thread. But that means we have a mis-configured system. We will warn
+    // the users by logging that information.
+    //
+    // 4. And the most important reason, heartbeats are not blocked even if
+    // this thread does not run, they will go into the processing queue.
+    scheduleNextHealthCheck();
+
+    return;
 
 Review comment:
   We don't need this return statement.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 262171)
    Time Spent: 40m  (was: 0.5h)

> GC other system pause events can trigger pipeline destroy for all the nodes 
> in the cluster
> ------------------------------------------------------------------------------------------
>
>                 Key: HDDS-1454
>                 URL: https://issues.apache.org/jira/browse/HDDS-1454
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>          Components: SCM
>            Reporter: Mukul Kumar Singh
>            Assignee: Supratim Deka
>            Priority: Major
>              Labels: MiniOzoneChaosCluster, pull-request-available
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> In a MiniOzoneChaosCluster run it was observed that events like GC pauses or 
> any other pauses in SCM can mark all the datanodes as stale in SCM. This will 
> trigger multiple pipeline destroy and will render the system unusable. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Work logged] (HDDS-1454) GC other system pause events can trigger pipeline destroy for all the nodes in the cluster

Reply via email to