[
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16920396#comment-16920396
]
Stephen O'Donnell commented on HDFS-13157:
------------------------------------------
The way I believe the redundancyMonitor works is as follows:
It picks the next live_nodes * work_multiplier (default 2) blocks from the
needs_replication queue in the order they were added to the queue.
Then it looks at the nodes which host the block and randomly picks one of them
to do the replication that does not have more than maxReplicationStreams
already allocated. I see the comment in chooseSourceDatanodes() that states it
prefers decommissioning nodes, and I think it implements this by allocating
max_replication_streams blocks to IN_SERVICE nodes but
max_replication_stream_Hard_Limit to decommissioning nodes, ie the
decommissioning nodes have a higher limit normally.
Imagine we have maxReplicationStreams of 5 (a very low setting - many clusters
will have this set to 50 or more) maxReplicationStreamsHardLimit of 10 and and
200 live nodes.
This means the redundancy monitor will pick 200 * 2 (work multiplier default) =
400 blocks to process on each iteration.
It will then randomly select 1 of the datanodes as a source, meaning the
decommissioning node will get allocated 10 (hard limit) out of the first 30
blocks on average. However, then it will have reached its maxStreamsLimit and
the remaining 370 blocks should be assigned to other nodes (assuming they have
capacity).
Therefore for replication factor 3 blocks, the decommissioning node will likely
replicate much less than a third of its own blocks, but all those blocks will
likely be on one disk.
My reading of the logic therefore also suggests that if you decommission
several nodes, then it will take the redundancy monitor some time to consider
the blocks from the second node, as it will work through the list of blocks in
order. However there is a good chance those other decommissioning nodes will be
participating in replicating blocks from the first node, so they are unlikely
to be idle.
However the scenario [~zhangchen] mentioned is an interesting one, where blocks
have replication factor 1 as the decommissioning node must be the source. In
that case I think the redundancy monitor would process 400 blocks, assign 10 of
them and skip the next 390 on each iteration until it hits the second node
being decommissioned and repeat this until it cycles back to the start of the
under replicated list again. So if you are decommission nodes with replication
factor 1 blocks, not only would it use only 1 disk, but it would only work on
one decommissioning node at a time. I have not tested this, so there may be
some logic I have not understood fully to handle this sort of case.
> Do Not Remove Blocks Sequentially During Decommission
> ------------------------------------------------------
>
> Key: HDFS-13157
> URL: https://issues.apache.org/jira/browse/HDFS-13157
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: datanode, namenode
> Affects Versions: 3.0.0
> Reporter: David Mollitor
> Assignee: David Mollitor
> Priority: Major
>
> From what I understand of [DataNode
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
> it appears that all the blocks are scheduled for removal _in order._. I'm
> not 100% sure what the ordering is exactly, but I think it loops through each
> data volume and schedules each block to be replicated elsewhere. The net
> affect is that during a decommission, all of the DataNode transfer threads
> slam on a single volume until it is cleaned out. At which point, they all
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution
> across all volumes when decommissioning a node.
--
This message was sent by Atlassian Jira
(v8.3.2#803003)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]