HDFS-11035. Better documentation for maintenace mode and upgrade domain.

Project: http://git-wip-us.apache.org/repos/asf/hadoop/repo
Commit: http://git-wip-us.apache.org/repos/asf/hadoop/commit/ce943eb1
Tree: http://git-wip-us.apache.org/repos/asf/hadoop/tree/ce943eb1
Diff: http://git-wip-us.apache.org/repos/asf/hadoop/diff/ce943eb1

Branch: refs/heads/YARN-5734
Commit: ce943eb17a4218d8ac1f5293c6726122371d8442
Parents: 230b85d
Author: Ming Ma <min...@twitter.com>
Authored: Wed Sep 20 09:36:33 2017 -0700
Committer: Ming Ma <min...@twitter.com>
Committed: Wed Sep 20 09:36:33 2017 -0700

----------------------------------------------------------------------
 .../src/site/markdown/HdfsDataNodeAdminGuide.md | 165 ++++++++++++++++++
 .../src/site/markdown/HdfsUpgradeDomain.md      | 167 +++++++++++++++++++
 hadoop-project/src/site/site.xml                |   4 +-
 3 files changed, 335 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/hadoop/blob/ce943eb1/hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HdfsDataNodeAdminGuide.md
----------------------------------------------------------------------
diff --git 
a/hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HdfsDataNodeAdminGuide.md 
b/hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HdfsDataNodeAdminGuide.md
new file mode 100644
index 0000000..d6f288e
--- /dev/null
+++ 
b/hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HdfsDataNodeAdminGuide.md
@@ -0,0 +1,165 @@
+<!---
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+HDFS DataNode Admin Guide
+=================
+
+<!-- MACRO{toc|fromDepth=0|toDepth=3} -->
+
+Overview
+--------
+
+The Hadoop Distributed File System (HDFS) namenode maintains states of all 
datanodes.
+There are two types of states. The fist type describes the liveness of a 
datanode indicating if
+the node is live, dead or stale. The second type describes the admin state 
indicating if the node
+is in service, decommissioned or under maintenance.
+
+When an administrator decommission a datanode, the datanode will first be 
transitioned into
+`DECOMMISSION_INPROGRESS` state. After all blocks belonging to that datanode 
have been fully replicated elsewhere
+based on each block's replication factor. the datanode will be transitioned to 
`DECOMMISSIONED` state. After that,
+the administrator can shutdown the node to perform long-term repair and 
maintenance that could take days or weeks.
+After the machine has been repaired, the machine can be recommissioned back to 
the cluster.
+
+Sometimes administrators only need to take datanodes down for minutes/hours to 
perform short-term repair/maintenance.
+In such scenario, the HDFS block replication overhead incurred by decommission 
might not be necessary and a light-weight process is desirable.
+And that is what maintenance state is used for. When an administrator put a 
datanode in maintenance state, the datanode will first be transitioned
+to `ENTERING_MAINTENANCE` state. As long as all blocks belonging to that 
datanode is minimally replicated elsewhere, the datanode
+will immediately be transitioned to `IN_MAINTENANCE` state. After the 
maintenance has completed, the administrator can take the datanode
+out of the maintenance state. In addition, maintenance state supports timeout 
that allows administrators to config the maximum duration in
+which a datanode is allowed to stay in maintenance state. After the timeout, 
the datanode will be transitioned out of maintenance state
+automatically by HDFS without human intervention.
+
+In summary, datanode admin operations include the followings:
+
+* Decommission
+* Recommission
+* Putting nodes in maintenance state
+* Taking nodes out of maintenance state
+
+And datanode admin states include the followings:
+
+* `NORMAL` The node is in service.
+* `DECOMMISSIONED` The node has been decommissioned.
+* `DECOMMISSION_INPROGRESS` The node is being transitioned to DECOMMISSIONED 
state.
+* `IN_MAINTENANCE` The node in in maintenance state.
+* `ENTERING_MAINTENANCE` The node is being transitioned to maintenance state.
+
+
+Host-level settings
+-----------
+
+To perform any of datanode admin operations, there are two steps.
+
+* Update host-level configuration files to indicate the desired admin states 
of targeted datanodes. There are two supported formats for configuration files.
+    * Hostname-only configuration. Each line includes the hostname/ip address 
for a datanode. That is the default format.
+    * JSON-based configuration. The configuration is in JSON format. Each 
element maps to one datanode and each datanode can have multiple properties. 
This format is required to put datanodes to maintenance states.
+
+* Run the following command to have namenode reload the host-level 
configuration files.
+`hdfs dfsadmin [-refreshNodes]`
+
+### Hostname-only configuration
+This is the default configuration used by the namenode. It only supports node 
decommission and recommission; it doesn't support admin operations related to 
maintenance state. Use `dfs.hosts` and `dfs.hosts.exclude` as explained in 
[hdfs-default.xml](./hdfs-default.xml).
+
+In the following example, `host1` and `host2` need to be in service.
+`host3` and `host4` need to be in decommissioned state.
+
+dfs.hosts file
+```text
+host1
+host2
+host3
+host4
+```
+dfs.hosts.exclude file
+```text
+host3
+host4
+```
+
+### JSON-based configuration
+
+JSON-based format is the new configuration format that supports generic 
properties on datanodes. Set the following
+configurations to enable JSON-based format as explained in 
[hdfs-default.xml](./hdfs-default.xml).
+
+
+| Setting | Value |
+|:---- |:---- |
+|`dfs.namenode.hosts.provider.classname`| 
`org.apache.hadoop.hdfs.server.blockmanagement.CombinedHostFileManager`|
+|`dfs.hosts`| the path of the json hosts file |
+
+Here is the list of currently supported properties by HDFS.
+
+
+| Property | Description |
+|:---- |:---- |
+|`hostName`| Required. The host name of the datanode. |
+|`upgradeDomain`| Optional. The upgrade domain id of the datanode. |
+|`adminState`| Optional. The expected admin state. The default value is 
`NORMAL`; `DECOMMISSIONED` for decommission; `IN_MAINTENANCE` for maintenance 
state. |
+|`port`| Optional. the port number of the datanode |
+|`maintenanceExpireTimeInMS`| Optional. The epoch time in milliseconds until 
which the datanode will remain in maintenance state. The default value is 
forever. |
+
+In the following example, `host1` and `host2` need to in service. `host3` need 
to be in decommissioned state. `host4` need to be in in maintenance state.
+
+dfs.hosts file
+```json
+[
+  {
+    "hostName": "host1"
+  },
+  {
+    "hostName": "host2",
+    "upgradeDomain": "ud0"
+  },
+  {
+    "hostName": "host3",
+    "adminState": "DECOMMISSIONED"
+  },
+  {
+    "hostName": "host4",
+    "upgradeDomain": "ud2",
+    "adminState": "IN_MAINTENANCE"
+  }
+]
+```
+
+
+Cluster-level settings
+-----------
+
+There are several cluster-level settings related to datanode administration.
+For common use cases, you should rely on the default values. Please refer to
+[hdfs-default.xml](./hdfs-default.xml) for descriptions and default values.
+
+```text
+dfs.namenode.maintenance.replication.min
+dfs.namenode.decommission.interval
+dfs.namenode.decommission.blocks.per.interval
+dfs.namenode.decommission.max.concurrent.tracked.nodes
+```
+
+Metrics
+-----------
+
+Admin states are part of the namenode's webUI and JMX. As explained in 
[HDFSCommands.html](./HDFSCommands.html), you can also verify admin states 
using the following commands.
+
+Use `dfsadmin` to check admin states at the cluster level.
+
+`hdfs dfsadmin -report`
+
+Use `fsck` to check admin states of datanodes storing data at a specific path. 
For backward compatibility, a special flag is required to return maintenance 
states.
+
+```text
+hdfs fsck <path> // only show decommission state
+hdfs fsck <path> -maintenance // include maintenance state
+```

http://git-wip-us.apache.org/repos/asf/hadoop/blob/ce943eb1/hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HdfsUpgradeDomain.md
----------------------------------------------------------------------
diff --git 
a/hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HdfsUpgradeDomain.md 
b/hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HdfsUpgradeDomain.md
new file mode 100644
index 0000000..15a4bae
--- /dev/null
+++ b/hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HdfsUpgradeDomain.md
@@ -0,0 +1,167 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+HDFS Upgrade Domain
+====================
+
+<!-- MACRO{toc|fromDepth=0|toDepth=3} -->
+
+
+Introduction
+------------
+
+The current default HDFS block placement policy guarantees that a block’s 3 
replicas will be placed
+on at least 2 racks. Specifically one replica is placed on one rack and the 
other two replicas
+are placed on another rack during write pipeline. This is a good compromise 
between rack diversity and write-pipeline efficiency. Note that
+subsequent load balancing or machine membership change might cause 3 replicas 
of a block to be distributed
+across 3 different racks. Thus any 3 datanodes in different racks could store 
3 replicas of a block.
+
+
+However, the default placement policy impacts how we should perform datanode 
rolling upgrade.
+[HDFS Rolling Upgrade document](./HdfsRollingUpgrade.html) explains how the 
datanodes can be upgraded in a rolling
+fashion without downtime. Because any 3 datanodes in different racks could 
store all the replicas of a block, it is
+important to perform sequential restart of datanodes one at a time in order to 
minimize the impact on data availability
+and read/write operations. Upgrading one rack at a time is another option; but 
that will increase the chance of
+data unavailability if there is machine failure at another rack during the 
upgrade.
+
+The side effect of this sequential datanode rolling upgrade strategy is longer
+upgrade duration for larger clusters.
+
+
+Architecture
+-------
+
+To address the limitation of block placement policy on rolling upgrade, the 
concept of upgrade domain
+has been added to HDFS via a new block placement policy. The idea is to group 
datanodes in a new
+dimension called upgrade domain, in addition to the existing rack-based 
grouping.
+For example, we can assign all datanodes in the first position of any rack to 
upgrade domain ud_01,
+nodes in the second position to upgrade domain ud_02 and so on.
+
+The namenode provides BlockPlacementPolicy interface to support any custom 
block placement besides
+the default block placement policy. A new upgrade domain block placement 
policy based on this interface
+is available in HDFS. It will make sure replicas of any given block are 
distributed across machines from different upgrade domains.
+By default, 3 replicas of any given block are placed on 3 different upgrade 
domains. This means all datanodes belonging to
+a specific upgrade domain collectively won't store more than one replica of 
any block.
+
+With upgrade domain block placement policy in place, we can upgrade all 
datanodes belonging to one upgrade domain at the
+same time without impacting data availability. Only after finishing upgrading 
one upgrade domain we move to the next
+upgrade domain until all upgrade domains have been upgraded. Such procedure 
will ensure no two replicas of any given
+block will be upgraded at the same time. This means we can upgrade many 
machines at the same time for a large cluster.
+And as the cluster continues to scale, new machines will be added to the 
existing upgrade domains without impact the
+parallelism of the upgrade.
+
+For an existing cluster with the default block placement policy, after 
switching to the new upgrade domain block
+placement policy, any newly created blocks will conform the new policy. The 
old blocks allocated based on the old policy
+need to migrated the new policy. There is a migrator tool you can use. See 
HDFS-8789 for details.
+
+
+Settings
+-------
+
+To enable upgrade domain on your clusters, please follow these steps:
+
+* Assign datanodes to individual upgrade domain groups.
+* Enable upgrade domain block placement policy.
+* Migrate blocks allocated based on old block placement policy to the new 
upgrade domain policy.
+
+### Upgrade domain id assignment
+
+How a datanode maps to an upgrade domain id is defined by administrators and 
specific to the cluster layout.
+A common way to use the rack position of the machine as its upgrade domain id.
+
+To configure mapping from host name to its upgrade domain id, we need to use 
json-based host configuration file.
+by setting the following property as explained in 
[hdfs-default.xml](./hdfs-default.xml).
+
+| Setting | Value |
+|:---- |:---- |
+|`dfs.namenode.hosts.provider.classname` | 
`org.apache.hadoop.hdfs.server.blockmanagement.CombinedHostFileManager`|
+|`dfs.hosts`| the path of the json hosts file |
+
+The json hosts file defines the property for all hosts. In the following 
example,
+there are 4 datanodes in 2 racks; the machines at rack position 01 belong to 
upgrade domain 01;
+the machines at rack position 02 belong to upgrade domain 02.
+
+```json
+[
+  {
+    "hostName": "dcA­rackA­01",
+    "upgradeDomain": "01"
+  },
+  {
+    "hostName": "dcA­rackA­02",
+    "upgradeDomain": "02"
+  },
+  {
+    "hostName": "dcA­rackB­01",
+    "upgradeDomain": "01"
+  },
+  {
+    "hostName": "dcA­rackB­02",
+    "upgradeDomain": "02"
+  }
+]
+```
+
+
+### Enable upgrade domain block placement policy
+
+After each datanode has been assigned an upgrade domain id, the next step is 
to enable
+upgrade domain block placement policy with the following configuration as 
explained in [hdfs-default.xml](./hdfs-default.xml).
+
+| Setting | Value |
+|:---- |:---- |
+|`dfs.block.replicator.classname`| 
`org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyWithUpgradeDomain`
 |
+
+After restarting of namenode, the new policy will be used for any new block 
allocation.
+
+
+### Migration
+
+If you change the block placement policy of an existing cluster, you will need 
to make sure the
+blocks allocated prior to the block placement policy change conform the new 
block placement policy.
+
+HDFS-8789 provides the initial draft patch of a client-side migration tool. 
After the tool is committed,
+we will be able to describe how to use the tool.
+
+
+Rolling restart based on upgrade domains
+-------
+
+During cluster administration, we might need to restart datanodes to pick up 
new configuration, new hadoop release
+or JVM version and so on. With upgrade domains enabled and all blocks on the 
cluster conform to the new policy, we can now
+restart datanodes in batches, one upgrade domain at a time. Whether it is 
manual process or via automation, the steps are
+
+* Group datanodes by upgrade domains based on dfsadmin or JMX's datanode 
information.
+* For each upgrade domain
+    * (Optional) put all the nodes in that upgrade domain to maintenance state 
(refer to [HdfsDataNodeAdminGuide.html](./HdfsDataNodeAdminGuide.html)).
+    * Restart all those nodes.
+    * Check if all datanodes are healthy after restart. Unhealthy nodes should 
be decommissioned.
+    * (Optional) Take all those nodes out of maintenance state.
+
+
+Metrics
+-----------
+
+Upgrade domains are part of namenode's JMX. As explained in 
[HDFSCommands.html](./HDFSCommands.html), you can also verify upgrade domains 
using the following commands.
+
+Use `dfsadmin` to check upgrade domains at the cluster level.
+
+`hdfs dfsadmin -report`
+
+Use `fsck` to check upgrade domains of datanodes storing data at a specific 
path.
+
+`hdfs fsck <path> -files -blocks -upgradedomains`

http://git-wip-us.apache.org/repos/asf/hadoop/blob/ce943eb1/hadoop-project/src/site/site.xml
----------------------------------------------------------------------
diff --git a/hadoop-project/src/site/site.xml b/hadoop-project/src/site/site.xml
index 4685e2a..a88f0e3 100644
--- a/hadoop-project/src/site/site.xml
+++ b/hadoop-project/src/site/site.xml
@@ -101,7 +101,9 @@
       <item name="Synthetic Load Generator" 
href="hadoop-project-dist/hadoop-hdfs/SLGUserGuide.html"/>
       <item name="Erasure Coding" 
href="hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html"/>
       <item name="Disk Balancer" 
href="hadoop-project-dist/hadoop-hdfs/HDFSDiskbalancer.html"/>
-   </menu>
+      <item name="Upgrade Domain" 
href="hadoop-project-dist/hadoop-hdfs/HdfsUpgradeDomain.html"/>
+      <item name="DataNode Admin" 
href="hadoop-project-dist/hadoop-hdfs/HdfsDataNodeAdminGuide.html"/>
+    </menu>
 
     <menu name="MapReduce" inherit="top">
       <item name="Tutorial" 
href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html"/>


---------------------------------------------------------------------
To unsubscribe, e-mail: common-commits-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-commits-h...@hadoop.apache.org

Reply via email to