[jira] [Commented] (HDFS-13157) Do Not Remove Blocks Sequentially During Decommission

Stephen O'Donnell (Jira) Thu, 12 Sep 2019 03:21:26 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16928431#comment-16928431
 ]


Stephen O'Donnell commented on HDFS-13157:
------------------------------------------

I've been thinking about this issue on and off over the last few days and I 
realised that most of the problems come from dumping too many blocks onto a 
queue all at once. In this issue we have uncovered the following problems:
 # Blocks are placed on the queue disk by disk for a datanode, and the queue is 
basically a FIFO, limiting decommission speed of that node to the speed of 1 
disk. It is hard to break away from that order, as our minimum unit of work 
under a lock is probably a single DN storage.
 # We also process each DN one at a time, or an operator can decommission one 
node, followed by another shortly after, so the first node to be decommissioned 
tends to decommission faster than the others, and hence does not use all the 
cluster resources it can.
 # I suspect that sometimes blocks get skipped when processing the replication 
queue if there are no replication streams available, and it can take a very 
long time for it to cycle back around to the start if there are millions of 
blocks to process. This could result in some nodes failing to complete 
decommission due to only a few blocks that were skipped previously.

It is also worth noting that while the under replicated blocks are placed onto 
the replication queue, the list of them is also retained in the decommission 
monitor, and they are all checked periodically to see if they have replicated 
yet. This means, the pending blocks will be checked many times over the hours 
or days decommission is running. This is also a waste of resources.

I think we could change the logic, so that when a node (or nodes) are 
decommissioned, we:
 # Process the blocks on the node and store them in the decommission monitor, 
but do not yet add them to the replication queue.
 # In the decommission monitor, we should store the blocks in a structure that 
allows them to be retrieved efficiently in a random order. Perhaps a hashmap, 
but there may be something better.
 # This means that the blocks from each disk and each node will be naturally in 
this structure in a random order.
 # Every X seconds, check the size of the pending replication queue. If it is 
under some threshold, take Y blocks from the 'waiting to decommission block 
list', check if they are under-replicated, and if so, add them to the 
replication queue and store them in a new pending list in the decommission 
monitor.
 # Periodically check only the blocks in the pending list to see if they have 
been replicated, and check if the replication queue is under some threshold, 
then repeat the process.

This way we avoid flooding the replication queue with millions of blocks and 
release them onto it in a more controlled fashion.

The blocks should be in a random order too.

If a node failure happens, the replication queue will fill up with its blocks 
and decommission will temporarily stall, giving priority to recovering the 
blocks from the failed node, which is probably desirable.

 

> Do Not Remove Blocks Sequentially During Decommission 
> ------------------------------------------------------
>
>                 Key: HDFS-13157
>                 URL: https://issues.apache.org/jira/browse/HDFS-13157
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: datanode, namenode
>    Affects Versions: 3.0.0
>            Reporter: David Mollitor
>            Assignee: David Mollitor
>            Priority: Major
>         Attachments: HDFS-13157.1.patch
>
>
> From what I understand of [DataNode 
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
>  it appears that all the blocks are scheduled for removal _in order._. I'm 
> not 100% sure what the ordering is exactly, but I think it loops through each 
> data volume and schedules each block to be replicated elsewhere. The net 
> affect is that during a decommission, all of the DataNode transfer threads 
> slam on a single volume until it is cleaned out. At which point, they all 
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution 
> across all volumes when decommissioning a node.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-13157) Do Not Remove Blocks Sequentially During Decommission

Reply via email to