Elek, Marton created HDDS-1284:
----------------------------------

             Summary: Adjust default values of pipline recovery for more 
resilient service restart 
                 Key: HDDS-1284
                 URL: https://issues.apache.org/jira/browse/HDDS-1284
             Project: Hadoop Distributed Data Store
          Issue Type: Bug
            Reporter: Elek, Marton
            Assignee: Elek, Marton


As of now we have a following algorithm to handle node failures:

1. In case of a missing node the leader of the pipline or the scm can detected 
the missing heartbeats.
2. SCM will start to close the pipeline (CLOSING state) and try to close the 
containers with the remaining nodes in the pipeline
3. After 5 minutes the pipeline will be destroyed (CLOSED) and a new pipeline 
can be created from the healthy nodes (one node can be part only one pipwline 
in the same time).

While this algorithm can work well with a big cluster it doesn't provide very 
good usability on small clusters:

Use case1:

Given 3 nodes, in case of a service restart, if the restart takes more than 
90s, the pipline will be moved to the CLOSING state. For the next 5 minutes 
(ozone.scm.pipeline.destroy.timeout) the container will remain in the CLOSING 
state. As there are no more nodes and we can't assign the same node to two 
different pipeline, the cluster will be unavailable for 5 minutes.

Use case2:

Given 90 nodes and 30 pipelines where all the pipelines are spread across 3 
racks. Let's stop one rack. As all the pipelines are affected, all the 
pipelines will be moved to the CLOSING state. We have no free nodes, therefore 
we need to wait for 5 minutes to write any data to the cluster.

These problems can be solved in multiple ways:

1.) Instead of waiting 5 minutes, destroy the pipeline when all the containers 
are reported to be closed. (Most of the time it's enough, but some container 
report can be missing)
2.) Support multi-raft and open a pipeline as soon as we have enough nodes 
(even if the nodes already have a CLOSING pipelines).

Both the options require more work on the pipeline management side. For 0.4.0 
we can adjust the following parameters to get better user experience:

{code}
  <property>
    <name>ozone.scm.pipeline.destroy.timeout</name>
    <value>60s</value>
    <tag>OZONE, SCM, PIPELINE</tag>
    <description>
      Once a pipeline is closed, SCM should wait for the above configured time
      before destroying a pipeline.
    </description>

  <property>
    <name>ozone.scm.stale.node.interval</name>
    <value>90s</value>
    <tag>OZONE, MANAGEMENT</tag>
    <description>
      The interval for stale node flagging. Please
      see ozone.scm.heartbeat.thread.interval before changing this value.
    </description>
  </property>
 {code}

First of all, we can be more optimistic and mark node to stale only after 5 
mins instead of 90s. 5 mins should be enough most of the time to recover the 
nodes.

Second: we can decrease the time of ozone.scm.pipeline.destroy.timeout. Ideally 
the close command is sent by the scm to the datanode with a HB. Between two HB 
we have enough time to close all the containers via ratis. With the next HB, 
datanode can report the successful datanode. (If the containers can be closed 
the scm can manage the QUASI_CLOSED containers)

We need to wait 29 seconds (worst case) for the next HB, and 29+30 seconds for 
the confirmation. --> 66 seconds seems to be a safe choice (assuming that 6 
seconds is enough to process the report about the successful closing)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to