[
https://issues.apache.org/jira/browse/HDDS-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16793963#comment-16793963
]
Ajay Kumar commented on HDDS-1284:
----------------------------------
+1, [~elek] thanks for contribution. Committed to trunk and 0.4.
> Adjust default values of pipline recovery for more resilient service restart
> ----------------------------------------------------------------------------
>
> Key: HDDS-1284
> URL: https://issues.apache.org/jira/browse/HDDS-1284
> Project: Hadoop Distributed Data Store
> Issue Type: Bug
> Reporter: Elek, Marton
> Assignee: Elek, Marton
> Priority: Critical
> Labels: pull-request-available
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> As of now we have a following algorithm to handle node failures:
> 1. In case of a missing node the leader of the pipline or the scm can
> detected the missing heartbeats.
> 2. SCM will start to close the pipeline (CLOSING state) and try to close the
> containers with the remaining nodes in the pipeline
> 3. After 5 minutes the pipeline will be destroyed (CLOSED) and a new pipeline
> can be created from the healthy nodes (one node can be part only one pipwline
> in the same time).
> While this algorithm can work well with a big cluster it doesn't provide very
> good usability on small clusters:
> Use case1:
> Given 3 nodes, in case of a service restart, if the restart takes more than
> 90s, the pipline will be moved to the CLOSING state. For the next 5 minutes
> (ozone.scm.pipeline.destroy.timeout) the container will remain in the CLOSING
> state. As there are no more nodes and we can't assign the same node to two
> different pipeline, the cluster will be unavailable for 5 minutes.
> Use case2:
> Given 90 nodes and 30 pipelines where all the pipelines are spread across 3
> racks. Let's stop one rack. As all the pipelines are affected, all the
> pipelines will be moved to the CLOSING state. We have no free nodes,
> therefore we need to wait for 5 minutes to write any data to the cluster.
> These problems can be solved in multiple ways:
> 1.) Instead of waiting 5 minutes, destroy the pipeline when all the
> containers are reported to be closed. (Most of the time it's enough, but some
> container report can be missing)
> 2.) Support multi-raft and open a pipeline as soon as we have enough nodes
> (even if the nodes already have a CLOSING pipelines).
> Both the options require more work on the pipeline management side. For 0.4.0
> we can adjust the following parameters to get better user experience:
> {code}
> <property>
> <name>ozone.scm.pipeline.destroy.timeout</name>
> <value>60s</value>
> <tag>OZONE, SCM, PIPELINE</tag>
> <description>
> Once a pipeline is closed, SCM should wait for the above configured time
> before destroying a pipeline.
> </description>
> <property>
> <name>ozone.scm.stale.node.interval</name>
> <value>90s</value>
> <tag>OZONE, MANAGEMENT</tag>
> <description>
> The interval for stale node flagging. Please
> see ozone.scm.heartbeat.thread.interval before changing this value.
> </description>
> </property>
> {code}
> First of all, we can be more optimistic and mark node to stale only after 5
> mins instead of 90s. 5 mins should be enough most of the time to recover the
> nodes.
> Second: we can decrease the time of ozone.scm.pipeline.destroy.timeout.
> Ideally the close command is sent by the scm to the datanode with a HB.
> Between two HB we have enough time to close all the containers via ratis.
> With the next HB, datanode can report the successful datanode. (If the
> containers can be closed the scm can manage the QUASI_CLOSED containers)
> We need to wait 29 seconds (worst case) for the next HB, and 29+30 seconds
> for the confirmation. --> 66 seconds seems to be a safe choice (assuming that
> 6 seconds is enough to process the report about the successful closing)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]