[
https://issues.apache.org/jira/browse/HDFS-7128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gera Shegalov reassigned HDFS-7128:
-----------------------------------
Assignee: Gera Shegalov
> Decommission slows way down when it gets towards the end
> --------------------------------------------------------
>
> Key: HDFS-7128
> URL: https://issues.apache.org/jira/browse/HDFS-7128
> Project: Hadoop HDFS
> Issue Type: Improvement
> Reporter: Ming Ma
> Assignee: Gera Shegalov
>
> When we decommission nodes across different racks, the decommission process
> becomes really slow at the end, hardly making any progress. The problem is
> some blocks are on 3 decomm-in-progress DNs and the way how replications are
> scheduled caused unnecessary delay. Here is the analysis.
> When BlockManager schedules the replication work from neededReplication, it
> first needs to pick the source node for replication via chooseSourceDatanode.
> The core policies to pick the source node are:
> 1. Prefer decomm-in-progress node.
> 2. Only pick the nodes whose outstanding replication counts are below
> thresholds dfs.namenode.replication.max-streams or
> dfs.namenode.replication.max-streams-hard-limit, based on the replication
> priority.
> When we decommission nodes,
> 1. All the decommission nodes' blocks will be added to neededReplication.
> 2. BM will pick X number of blocks from neededReplication in each iteration.
> X is based on cluster size and some configurable multiplier. So if the
> cluster has 2000 nodes, X will be around 4000.
> 3. Given these 4000 nodes are on the same decomm-in-progress node A, A end up
> being chosen as the source node of all these 4000 nodes. The reason the
> outstanding replication thresholds don't kick is due to the implementation of
> BlockManager.computeReplicationWorkForBlocks;
> node.getNumberOfBlocksToBeReplicated() remains zero given
> node.addBlockToBeReplicated is called after source node iteration.
> {noformat}
> ...
> synchronized (neededReplications) {
> for (int priority = 0; priority < blocksToReplicate.size();
> priority++) {
> ...
> chooseSourceDatanode
> ...
> }
> for(ReplicationWork rw : work){
> ...
> rw.srcNode.addBlockToBeReplicated(block, targets);
> ...
> }
> {noformat}
>
> 4. So several decomm-in-progress nodes A, B, C end up with 4000
> node.getNumberOfBlocksToBeReplicated().
> 5. If we assume each node can replicate 5 blocks per minutes, it is going to
> take 800 minutes to finish replication of these blocks.
> 6. Pending replication timeout kick in after 5 minutes. The items will be
> removed from the pending replication queue and added back to
> neededReplication. The replications will then be handled by other source
> nodes of these blocks. But the blocks still remain in nodes A, B, C's pending
> replication queue, DatanodeDescriptor.replicateBlocks, so A, B, C continue
> the replications of these blocks, although these blocks might have been
> replicated by other DNs after replication timeout.
> 7. Some block' replicas exist on A, B, C and it is at the end of A's pending
> replication queue. Even though the block's replication timeout, no source
> node can be chosen given A, B, C all have high pending replication count. So
> we have to wait until A drains its pending replication queue. Meanwhile, the
> items in A's pending replication queue have been taken care of by other nodes
> and no longer under replicated.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)