[ https://issues.apache.org/jira/browse/HDFS-7128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144516#comment-14144516 ]
Hadoop QA commented on HDFS-7128: --------------------------------- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12670602/HDFS-7128.patch against trunk revision 7b8df93. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestEncryptionZonesWithKMS org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover org.apache.hadoop.hdfs.server.balancer.TestBalancer org.apache.hadoop.hdfs.server.datanode.fsdataset.TestAvailableSpaceVolumeChoosingPolicy org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8160//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8160//console This message is automatically generated. > Decommission slows way down when it gets towards the end > -------------------------------------------------------- > > Key: HDFS-7128 > URL: https://issues.apache.org/jira/browse/HDFS-7128 > Project: Hadoop HDFS > Issue Type: Improvement > Reporter: Ming Ma > Assignee: Ming Ma > Attachments: HDFS-7128.patch > > > When we decommission nodes across different racks, the decommission process > becomes really slow at the end, hardly making any progress. The problem is > some blocks are on 3 decomm-in-progress DNs and the way how replications are > scheduled caused unnecessary delay. Here is the analysis. > When BlockManager schedules the replication work from neededReplication, it > first needs to pick the source node for replication via chooseSourceDatanode. > The core policies to pick the source node are: > 1. Prefer decomm-in-progress node. > 2. Only pick the nodes whose outstanding replication counts are below > thresholds dfs.namenode.replication.max-streams or > dfs.namenode.replication.max-streams-hard-limit, based on the replication > priority. > When we decommission nodes, > 1. All the decommission nodes' blocks will be added to neededReplication. > 2. BM will pick X number of blocks from neededReplication in each iteration. > X is based on cluster size and some configurable multiplier. So if the > cluster has 2000 nodes, X will be around 4000. > 3. Given these 4000 nodes are on the same decomm-in-progress node A, A end up > being chosen as the source node of all these 4000 nodes. The reason the > outstanding replication thresholds don't kick is due to the implementation of > BlockManager.computeReplicationWorkForBlocks; > node.getNumberOfBlocksToBeReplicated() remains zero given > node.addBlockToBeReplicated is called after source node iteration. > {noformat} > ... > synchronized (neededReplications) { > for (int priority = 0; priority < blocksToReplicate.size(); > priority++) { > ... > chooseSourceDatanode > ... > } > for(ReplicationWork rw : work){ > ... > rw.srcNode.addBlockToBeReplicated(block, targets); > ... > } > {noformat} > > 4. So several decomm-in-progress nodes A, B, C end up with 4000 > node.getNumberOfBlocksToBeReplicated(). > 5. If we assume each node can replicate 5 blocks per minutes, it is going to > take 800 minutes to finish replication of these blocks. > 6. Pending replication timeout kick in after 5 minutes. The items will be > removed from the pending replication queue and added back to > neededReplication. The replications will then be handled by other source > nodes of these blocks. But the blocks still remain in nodes A, B, C's pending > replication queue, DatanodeDescriptor.replicateBlocks, so A, B, C continue > the replications of these blocks, although these blocks might have been > replicated by other DNs after replication timeout. > 7. Some block' replicas exist on A, B, C and it is at the end of A's pending > replication queue. Even though the block's replication timeout, no source > node can be chosen given A, B, C all have high pending replication count. So > we have to wait until A drains its pending replication queue. Meanwhile, the > items in A's pending replication queue have been taken care of by other nodes > and no longer under replicated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)