[GitHub] [hadoop] KevinWikant commented on pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning


KevinWikant commented on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-975776386



   > The DatanodeAdminBackoffMonitor is probably rarely used, if it is used at 
all, but it does not have a tracking limit I think at the moment. Perhaps it 
should have, it it was designed to run with less overhead than the default 
monitor, but perhaps if you decommissioned 100's of nodes at a time it would 
struggle, I am not sure.
   
   Based on unit testing & code inspection, I think the 
"dfs.namenode.decommission.max.concurrent.tracked.nodes" still applies to 
DatanodeAdminBackoffMonitor
   
   From the code, DatanodeAdminBackoffMonitor:
   - has an [additional data structure 
pendingRep](https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminBackoffMonitor.java#L98)
       - note that [pendingRep is size constrained by 
"dfs.namenode.decommission.backoff.monitor.pending.limit"](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml)
   - will [process blocks from within 
pendingRep](https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminBackoffMonitor.java#L293)
   - then will [move blocks from outOfServiceBlocks to 
pendingRep](https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminBackoffMonitor.java#L296)
 so that they can be processed next cycle
   
   So:
   - [pendingRep gets its blocks from 
outOfServiceBlocks](https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminBackoffMonitor.java#L492)
   - [outOfServiceBlocks gets its datanodes from the 
pendingQueue](https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminBackoffMonitor.java#L226)
       - note how outOfServiceBlocks is still size constrained by 
"dfs.namenode.decommission.max.concurrent.tracked.nodes"
   
   The new unit test 
"TestDecommissionWithBackoffMonitor.testRequeueUnhealthyDecommissioningNodes" 
will fail without the changes made to "TestDecommissionWithBackoffMonitor". It 
fails because the unhealthy nodes have filled up the tracked set (i.e. 
outOfServiceBlocks) & the healthy nodes are stuck in the pendingNodes queue


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [hadoop] KevinWikant commented on pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Reply via email to