[ 
https://issues.apache.org/jira/browse/HDFS-7128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov reassigned HDFS-7128:
-----------------------------------

    Assignee: Gera Shegalov

> Decommission slows way down when it gets towards the end
> --------------------------------------------------------
>
>                 Key: HDFS-7128
>                 URL: https://issues.apache.org/jira/browse/HDFS-7128
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Ming Ma
>            Assignee: Gera Shegalov
>
> When we decommission nodes across different racks, the decommission process 
> becomes really slow at the end, hardly making any progress. The problem is 
> some blocks are on 3 decomm-in-progress DNs and the way how replications are 
> scheduled caused unnecessary delay. Here is the analysis.
> When BlockManager schedules the replication work from neededReplication, it 
> first needs to pick the source node for replication via chooseSourceDatanode. 
> The core policies to pick the source node are:
> 1. Prefer decomm-in-progress node.
> 2. Only pick the nodes whose outstanding replication counts are below 
> thresholds dfs.namenode.replication.max-streams or 
> dfs.namenode.replication.max-streams-hard-limit, based on the replication 
> priority.
> When we decommission nodes,
> 1. All the decommission nodes' blocks will be added to neededReplication.
> 2. BM will pick X number of blocks from neededReplication in each iteration. 
> X is based on cluster size and some configurable multiplier. So if the 
> cluster has 2000 nodes, X will be around 4000.
> 3. Given these 4000 nodes are on the same decomm-in-progress node A, A end up 
> being chosen as the source node of all these 4000 nodes. The reason the 
> outstanding replication thresholds don't kick is due to the implementation of 
> BlockManager.computeReplicationWorkForBlocks; 
> node.getNumberOfBlocksToBeReplicated() remains zero given 
> node.addBlockToBeReplicated is called after source node iteration.
> {noformat}
> ...
>       synchronized (neededReplications) {
>         for (int priority = 0; priority < blocksToReplicate.size(); 
> priority++) {
> ...
> chooseSourceDatanode
> ...
>         }
>       for(ReplicationWork rw : work){
> ...
>           rw.srcNode.addBlockToBeReplicated(block, targets);
> ...
>       }
> {noformat}
>  
> 4. So several decomm-in-progress nodes A, B, C end up with 4000 
> node.getNumberOfBlocksToBeReplicated().
> 5. If we assume each node can replicate 5 blocks per minutes, it is going to 
> take 800 minutes to finish replication of these blocks.
> 6. Pending replication timeout kick in after 5 minutes. The items will be 
> removed from the pending replication queue and added back to 
> neededReplication. The replications will then be handled by other source 
> nodes of these blocks. But the blocks still remain in nodes A, B, C's pending 
> replication queue, DatanodeDescriptor.replicateBlocks, so A, B, C continue 
> the replications of these blocks, although these blocks might have been 
> replicated by other DNs after replication timeout.
> 7. Some block' replicas exist on A, B, C and it is at the end of A's pending 
> replication queue. Even though the block's replication timeout, no source 
> node can be chosen given A, B, C all have high pending replication count. So 
> we have to wait until A drains its pending replication queue. Meanwhile, the 
> items in A's pending replication queue have been taken care of by other nodes 
> and no longer under replicated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to