[ 
https://issues.apache.org/jira/browse/HDFS-16303?focusedWorklogId=682317&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-682317
 ]

ASF GitHub Bot logged work on HDFS-16303:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 16/Nov/21 23:18
            Start Date: 16/Nov/21 23:18
    Worklog Time Spent: 10m 
      Work Description: KevinWikant opened a new pull request #3667:
URL: https://github.com/apache/hadoop/pull/3667


   ### Description of PR
   
   Fixes a bug in Hadoop HDFS where if more than 
"dfs.namenode.decommission.max.concurrent.tracked.nodes" datanodes are lost 
while in state decommissioning, then all forward progress towards 
decommissioning any datanodes (including healthy datanodes) is blocked
   
   ### How was this patch tested?
   
   #### Unit Testing
   
   Added new unit tests:
   - TestDecommission.testRequeueUnhealthyDecommissioningNodes
   - DatanodeAdminMonitorBase.testPendingNodesQueueOrdering
   - DatanodeAdminMonitorBase.testPendingNodesQueueReverseOrdering
   
   All "TestDecommission" & "DatanodeAdminMonitorBase" tests pass when run 
locally
   
   Note that without the "DatanodeAdminManager" changes the new test 
"testRequeueUnhealthyDecommissioningNodes" fails because it times out waiting 
for the healthy nodes to be decommissioned
   
   ```
   > mvn -Dtest=TestDecommission#testRequeueUnhealthyDecommissioningNodes test
   ...
   [ERROR] Errors: 
   [ERROR]   TestDecommission.testRequeueUnhealthyDecommissioningNodes:1772 ยป 
Timeout Timed...
   ```
   
   #### Manual Testing
   
   - create Hadoop cluster with:
       - 30 datanodes initially
       - hdfs-site configuration 
"dfs.namenode.decommission.max.concurrent.tracked.node = 10"
       - custom Namenode JAR containing this change
   
   ```
   > cat /etc/hadoop/conf/hdfs-site.xml | grep -A 1 'tracked'
       <name>dfs.namenode.decommission.max.concurrent.tracked.nodes</name>
       <value>10</value>
   ```
   
   - reproduce the bug: https://issues.apache.org/jira/browse/HDFS-16303
       - start decommissioning over 20 datanodes
       - terminate 20 datanodes while decommissioning
       - observe the Namenode logs to validate that there are 20 unhealthy 
datanodes stuck "in Decommission In Progress"
   
   ```
   2021-11-15 17:57:44,485 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager 
(DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes 
will be tracked at a time. 10 nodes are currently queued waiting to be 
decommissioned.
   2021-11-15 17:57:44,485 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager 
(DatanodeAdminMonitor-0): 
dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, 
re-queueing 10 nodes which are dead while in Decommission In Progress.
   
   2021-11-15 17:58:14,485 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager 
(DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes 
will be tracked at a time. 10 nodes are currently queued waiting to be 
decommissioned.
   2021-11-15 17:58:14,485 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager 
(DatanodeAdminMonitor-0): 
dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, 
re-queueing 10 nodes which are dead while in Decommission In Progress.
   
   2021-11-15 17:58:44,485 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager 
(DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes 
will be tracked at a time. 10 nodes are currently queued waiting to be 
decommissioned.
   2021-11-15 17:58:44,485 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager 
(DatanodeAdminMonitor-0): 
dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, 
re-queueing 10 nodes which are dead while in Decommission In Progress.
   
   2021-11-15 17:59:14,485 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager 
(DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes 
will be tracked at a time. 10 nodes are currently queued waiting to be 
decommissioned.
   2021-11-15 17:59:14,485 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager 
(DatanodeAdminMonitor-0): 
dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, 
re-queueing 10 nodes which are dead while in Decommission In Progress.
   ```
   
   - scale-up to 25 healthy datanodes & then decommission 22 of those datanodes 
(all but 3)
       - observe the Namenode logs to validate those 22 healthy datanodes are 
decommissioned (i.e. HDFS-16303 is solved)
   
   ```
   2021-11-15 17:59:44,485 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager 
(DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes 
will be tracked at a time. 10 nodes are currently queued waiting to be 
decommissioned.
   2021-11-15 17:59:44,485 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager 
(DatanodeAdminMonitor-0): 
dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, 
re-queueing 10 nodes which are dead while in Decommission In Progress.
   
   2021-11-15 18:00:14,487 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager 
(DatanodeAdminMonitor-0): There are 42 nodes decommissioning but only 10 nodes 
will be tracked at a time. 32 nodes are currently queued waiting to be 
decommissioned.
   2021-11-15 18:00:44,485 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager 
(DatanodeAdminMonitor-0): There are 42 nodes decommissioning but only 10 nodes 
will be tracked at a time. 32 nodes are currently queued waiting to be 
decommissioned.
   2021-11-15 18:01:14,486 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager 
(DatanodeAdminMonitor-0): There are 32 nodes decommissioning but only 10 nodes 
will be tracked at a time. 32 nodes are currently queued waiting to be 
decommissioned.
   2021-11-15 18:01:44,486 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager 
(DatanodeAdminMonitor-0): There are 32 nodes decommissioning but only 10 nodes 
will be tracked at a time. 22 nodes are currently queued waiting to be 
decommissioned.
   2021-11-15 18:02:14,486 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager 
(DatanodeAdminMonitor-0): There are 22 nodes decommissioning but only 10 nodes 
will be tracked at a time. 22 nodes are currently queued waiting to be 
decommissioned.
   
   2021-11-15 18:02:44,485 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager 
(DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes 
will be tracked at a time. 12 nodes are currently queued waiting to be 
decommissioned.
   2021-11-15 18:02:44,485 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager 
(DatanodeAdminMonitor-0): 
dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, 
re-queueing 8 nodes which are dead while in Decommission In Progress.
   
   2021-11-15 18:03:14,485 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager 
(DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes 
will be tracked at a time. 10 nodes are currently queued waiting to be 
decommissioned.
   2021-11-15 18:03:14,485 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager 
(DatanodeAdminMonitor-0): 
dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, 
re-queueing 10 nodes which are dead while in Decommission In Progress.
   ```
   
   ### For code changes:
   
   - [yes] Does the title or this PR starts with the corresponding JIRA issue 
id (e.g. 'HADOOP-17799. Your PR title ...')?
   - [no] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [n/a] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [no] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

            Worklog Id:     (was: 682317)
    Remaining Estimate: 0h
            Time Spent: 10m

> Losing over 100 datanodes in state decommissioning results in full blockage 
> of all datanode decommissioning
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-16303
>                 URL: https://issues.apache.org/jira/browse/HDFS-16303
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.10.1, 3.3.1
>            Reporter: Kevin Wikant
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> h2. Impact
> HDFS datanode decommissioning does not make any forward progress. For 
> example, the user adds X datanodes to the "dfs.hosts.exclude" file and all X 
> of those datanodes remain in state decommissioning forever without making any 
> forward progress towards being decommissioned.
> h2. Root Cause
> The HDFS Namenode class "DatanodeAdminManager" is responsible for 
> decommissioning datanodes.
> As per this "hdfs-site" configuration:
> {quote}Config = dfs.namenode.decommission.max.concurrent.tracked.nodes 
>  Default Value = 100
> The maximum number of decommission-in-progress datanodes nodes that will be 
> tracked at one time by the namenode. Tracking a decommission-in-progress 
> datanode consumes additional NN memory proportional to the number of blocks 
> on the datnode. Having a conservative limit reduces the potential impact of 
> decomissioning a large number of nodes at once. A value of 0 means no limit 
> will be enforced.
> {quote}
> The Namenode will only actively track up to 100 datanodes for decommissioning 
> at any given time, as to avoid Namenode memory pressure.
> Looking into the "DatanodeAdminManager" code:
>  * a new datanode is only removed from the "tracked.nodes" set when it 
> finishes decommissioning
>  * a new datanode is only added to the "tracked.nodes" set if there is fewer 
> than 100 datanodes being tracked
> So in the event that there are more than 100 datanodes being decommissioned 
> at a given time, some of those datanodes will not be in the "tracked.nodes" 
> set until 1 or more datanodes in the "tracked.nodes" finishes 
> decommissioning. This is generally not a problem because the datanodes in 
> "tracked.nodes" will eventually finish decommissioning, but there is an edge 
> case where this logic prevents the namenode from making any forward progress 
> towards decommissioning.
> If all 100 datanodes in the "tracked.nodes" are unable to finish 
> decommissioning, then other datanodes (which may be able to be 
> decommissioned) will never get added to "tracked.nodes" and therefore will 
> never get the opportunity to be decommissioned.
> This can occur due the following issue:
> {quote}2021-10-21 12:39:24,048 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager 
> (DatanodeAdminMonitor-0): Node W.X.Y.Z:50010 is dead while in Decommission In 
> Progress. Cannot be safely decommissioned or be in maintenance since there is 
> risk of reduced data durability or data loss. Either restart the failed node 
> or force decommissioning or maintenance by removing, calling refreshNodes, 
> then re-adding to the excludes or host config files.
> {quote}
> If a Datanode is lost while decommissioning (for example if the underlying 
> hardware fails or is lost), then it will remain in state decommissioning 
> forever.
> If 100 or more Datanodes are lost while decommissioning over the Hadoop 
> cluster lifetime, then this is enough to completely fill up the 
> "tracked.nodes" set. With the entire "tracked.nodes" set filled with 
> datanodes that can never finish decommissioning, any datanodes added after 
> this point will never be able to be decommissioned because they will never be 
> added to the "tracked.nodes" set.
> In this scenario:
>  * the "tracked.nodes" set is filled with datanodes which are lost & cannot 
> be recovered (and can never finish decommissioning so they will never be 
> removed from the set)
>  * the actual live datanodes being decommissioned are enqueued waiting to 
> enter the "tracked.nodes" set (and are stuck waiting indefinitely)
> This means that no progress towards decommissioning the live datanodes will 
> be made unless the user takes the following action:
> {quote}Either restart the failed node or force decommissioning or maintenance 
> by removing, calling refreshNodes, then re-adding to the excludes or host 
> config files.
> {quote}
> Ideally, the Namenode should be able to gracefully handle scenarios where the 
> datanodes in the "tracked.nodes" set are not making forward progress towards 
> decommissioning while the enqueued datanodes may be able to make forward 
> progress.
> h2. Reproduce Steps
>  * create a Hadoop cluster
>  * lose (i.e. terminate the host/process forever) over 100 datanodes while 
> the datanodes are in state decommissioning
>  * add additional datanodes to the cluster
>  * attempt to decommission those new datanodes & observe that they are stuck 
> in state decommissioning forever
> Note that in this example each datanode, over the full history of the 
> cluster, has a unique IP address



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to