[GitHub] [hadoop] KevinWikant opened a new pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

GitBox Wed, 17 Nov 2021 14:09:54 -0800


KevinWikant opened a new pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675



   ### Description of PR
   
   Fixes a bug in Hadoop HDFS where if more than 
"dfs.namenode.decommission.max.concurrent.tracked.nodes" datanodes are lost 
while in state decommissioning, then all forward progress towards 
decommissioning any datanodes (including healthy datanodes) is blocked
   
   JIRA: https://issues.apache.org/jira/browse/HDFS-16303
   
   ### How was this patch tested?
   
   #### Unit Testing
   
   Added new unit tests:
   - TestDecommission.testRequeueUnhealthyDecommissioningNodes (& 
TestDecommissionWithBackoffMonitor.testRequeueUnhealthyDecommissioningNodes)
   - DatanodeAdminMonitorBase.testPendingNodesQueueOrdering
   - DatanodeAdminMonitorBase.testPendingNodesQueueReverseOrdering
   
   All "TestDecommission", "TestDecommissionWithBackoffMonitor", & 
"DatanodeAdminMonitorBase" tests pass when run locally
   
   Note that without the "DatanodeAdminManager" changes the new test 
"testRequeueUnhealthyDecommissioningNodes" fails because it times out waiting 
for the healthy nodes to be decommissioned
   
   ```
   > mvn -Dtest=TestDecommission#testRequeueUnhealthyDecommissioningNodes test
   ...
   [ERROR] Errors: 
   [ERROR]   TestDecommission.testRequeueUnhealthyDecommissioningNodes:1776 » 
Timeout Timed...
   ```
   
   ```
   > mvn 
-Dtest=TestDecommissionWithBackoffMonitor#testRequeueUnhealthyDecommissioningNodes
 test
   ...
   [ERROR] Errors: 
   [ERROR]   
TestDecommissionWithBackoffMonitor>TestDecommission.testRequeueUnhealthyDecommissioningNodes:1776
 » Timeout
   ```
   
   #### Manual Testing
   
   - create Hadoop cluster with:
       - 30 datanodes initially
       - custom Namenode JAR containing this change
       - hdfs-site configuration 
"dfs.namenode.decommission.max.concurrent.tracked.node = 10"
   
   ```
   > cat /etc/hadoop/conf/hdfs-site.xml | grep -A 1 'tracked'
       <name>dfs.namenode.decommission.max.concurrent.tracked.nodes</name>
       <value>10</value>
   ```
   
   - reproduce the bug: https://issues.apache.org/jira/browse/HDFS-16303
       - start decommissioning over 20 datanodes
       - terminate 20 datanodes while they are in state decommissioning
       - observe the Namenode logs to validate that there are 20 unhealthy 
datanodes stuck "in Decommission In Progress"
   
   ```
   2021-11-15 17:57:44,485 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager 
(DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes 
will be tracked at a time. 10 nodes are currently queued waiting to be 
decommissioned.
   2021-11-15 17:57:44,485 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager 
(DatanodeAdminMonitor-0): 
dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, 
re-queueing 10 nodes which are dead while in Decommission In Progress.
   
   2021-11-15 17:58:14,485 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager 
(DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes 
will be tracked at a time. 10 nodes are currently queued waiting to be 
decommissioned.
   2021-11-15 17:58:14,485 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager 
(DatanodeAdminMonitor-0): 
dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, 
re-queueing 10 nodes which are dead while in Decommission In Progress.
   
   2021-11-15 17:58:44,485 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager 
(DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes 
will be tracked at a time. 10 nodes are currently queued waiting to be 
decommissioned.
   2021-11-15 17:58:44,485 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager 
(DatanodeAdminMonitor-0): 
dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, 
re-queueing 10 nodes which are dead while in Decommission In Progress.
   
   2021-11-15 17:59:14,485 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager 
(DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes 
will be tracked at a time. 10 nodes are currently queued waiting to be 
decommissioned.
   2021-11-15 17:59:14,485 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager 
(DatanodeAdminMonitor-0): 
dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, 
re-queueing 10 nodes which are dead while in Decommission In Progress.
   ```
   
   - scale-up to 25 healthy datanodes & then decommission 22 of those datanodes 
(all but 3)
       - observe the Namenode logs to validate those 22 healthy datanodes are 
decommissioned (i.e. HDFS-16303 is solved)
   
   ```
   2021-11-15 17:59:44,485 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager 
(DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes 
will be tracked at a time. 10 nodes are currently queued waiting to be 
decommissioned.
   2021-11-15 17:59:44,485 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager 
(DatanodeAdminMonitor-0): 
dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, 
re-queueing 10 nodes which are dead while in Decommission In Progress.
   
   2021-11-15 18:00:14,487 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager 
(DatanodeAdminMonitor-0): There are 42 nodes decommissioning but only 10 nodes 
will be tracked at a time. 32 nodes are currently queued waiting to be 
decommissioned.
   2021-11-15 18:00:44,485 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager 
(DatanodeAdminMonitor-0): There are 42 nodes decommissioning but only 10 nodes 
will be tracked at a time. 32 nodes are currently queued waiting to be 
decommissioned.
   2021-11-15 18:01:14,486 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager 
(DatanodeAdminMonitor-0): There are 32 nodes decommissioning but only 10 nodes 
will be tracked at a time. 32 nodes are currently queued waiting to be 
decommissioned.
   2021-11-15 18:01:44,486 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager 
(DatanodeAdminMonitor-0): There are 32 nodes decommissioning but only 10 nodes 
will be tracked at a time. 22 nodes are currently queued waiting to be 
decommissioned.
   2021-11-15 18:02:14,486 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager 
(DatanodeAdminMonitor-0): There are 22 nodes decommissioning but only 10 nodes 
will be tracked at a time. 22 nodes are currently queued waiting to be 
decommissioned.
   
   2021-11-15 18:02:44,485 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager 
(DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes 
will be tracked at a time. 12 nodes are currently queued waiting to be 
decommissioned.
   2021-11-15 18:02:44,485 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager 
(DatanodeAdminMonitor-0): 
dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, 
re-queueing 8 nodes which are dead while in Decommission In Progress.
   
   2021-11-15 18:03:14,485 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager 
(DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes 
will be tracked at a time. 10 nodes are currently queued waiting to be 
decommissioned.
   2021-11-15 18:03:14,485 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager 
(DatanodeAdminMonitor-0): 
dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, 
re-queueing 10 nodes which are dead while in Decommission In Progress.
   ```
   
   ### For code changes:
   
   - [yes] Does the title or this PR starts with the corresponding JIRA issue 
id (e.g. 'HADOOP-17799. Your PR title ...')?
   - [n/a] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [n/a] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [no] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [hadoop] KevinWikant opened a new pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Reply via email to