[jira] [Commented] (HDFS-13157) Do Not Remove Blocks Sequentially During Decommission

Stephen O'Donnell (Jira) Sun, 01 Sep 2019 06:11:35 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16920396#comment-16920396
 ]


Stephen O'Donnell commented on HDFS-13157:
------------------------------------------

The way I believe the redundancyMonitor works is as follows:

It picks the next live_nodes * work_multiplier (default 2) blocks from the 
needs_replication queue in the order they were added to the queue.

Then it looks at the nodes which host the block and randomly picks one of them 
to do the replication that does not have more than maxReplicationStreams 
already allocated. I see the comment in chooseSourceDatanodes() that states it 
prefers decommissioning nodes, and I think it implements this by allocating 
max_replication_streams blocks to IN_SERVICE nodes but 
max_replication_stream_Hard_Limit to decommissioning nodes, ie the 
decommissioning nodes have a higher limit normally.

Imagine we have maxReplicationStreams of 5 (a very low setting - many clusters 
will have this set to 50 or more) maxReplicationStreamsHardLimit of 10 and and 
200 live nodes.

This means the redundancy monitor will pick 200 * 2 (work multiplier default) = 
400 blocks to process on each iteration.

It will then randomly select 1 of the datanodes as a source, meaning the 
decommissioning node will get allocated 10 (hard limit) out of the first 30 
blocks on average. However, then it will have reached its maxStreamsLimit and 
the remaining 370 blocks should be assigned to other nodes (assuming they have 
capacity).

Therefore for replication factor 3 blocks, the decommissioning node will likely 
replicate much less than a third of its own blocks, but all those blocks will 
likely be on one disk.

My reading of the logic therefore also suggests that if you decommission 
several nodes, then it will take the redundancy monitor some time to consider 
the blocks from the second node, as it will work through the list of blocks in 
order. However there is a good chance those other decommissioning nodes will be 
participating in replicating blocks from the first node, so they are unlikely 
to be idle.

However the scenario [~zhangchen] mentioned is an interesting one, where blocks 
have replication factor 1 as the decommissioning node must be the source. In 
that case I think the redundancy monitor would process 400 blocks, assign 10 of 
them and skip the next 390 on each iteration until it hits the second node 
being decommissioned and repeat this until it cycles back to the start of the 
under replicated list again. So if you are decommission nodes with replication 
factor 1 blocks, not only would it use only 1 disk, but it would only work on 
one decommissioning node at a time. I have not tested this, so there may be 
some logic I have not understood fully to handle this sort of case.

> Do Not Remove Blocks Sequentially During Decommission 
> ------------------------------------------------------
>
>                 Key: HDFS-13157
>                 URL: https://issues.apache.org/jira/browse/HDFS-13157
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: datanode, namenode
>    Affects Versions: 3.0.0
>            Reporter: David Mollitor
>            Assignee: David Mollitor
>            Priority: Major
>
> From what I understand of [DataNode 
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
>  it appears that all the blocks are scheduled for removal _in order._. I'm 
> not 100% sure what the ordering is exactly, but I think it loops through each 
> data volume and schedules each block to be replicated elsewhere. The net 
> affect is that during a decommission, all of the DataNode transfer threads 
> slam on a single volume until it is cleaned out. At which point, they all 
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution 
> across all volumes when decommissioning a node.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-13157) Do Not Remove Blocks Sequentially During Decommission

Reply via email to