[
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16928431#comment-16928431
]
Stephen O'Donnell commented on HDFS-13157:
------------------------------------------
I've been thinking about this issue on and off over the last few days and I
realised that most of the problems come from dumping too many blocks onto a
queue all at once. In this issue we have uncovered the following problems:
# Blocks are placed on the queue disk by disk for a datanode, and the queue is
basically a FIFO, limiting decommission speed of that node to the speed of 1
disk. It is hard to break away from that order, as our minimum unit of work
under a lock is probably a single DN storage.
# We also process each DN one at a time, or an operator can decommission one
node, followed by another shortly after, so the first node to be decommissioned
tends to decommission faster than the others, and hence does not use all the
cluster resources it can.
# I suspect that sometimes blocks get skipped when processing the replication
queue if there are no replication streams available, and it can take a very
long time for it to cycle back around to the start if there are millions of
blocks to process. This could result in some nodes failing to complete
decommission due to only a few blocks that were skipped previously.
It is also worth noting that while the under replicated blocks are placed onto
the replication queue, the list of them is also retained in the decommission
monitor, and they are all checked periodically to see if they have replicated
yet. This means, the pending blocks will be checked many times over the hours
or days decommission is running. This is also a waste of resources.
I think we could change the logic, so that when a node (or nodes) are
decommissioned, we:
# Process the blocks on the node and store them in the decommission monitor,
but do not yet add them to the replication queue.
# In the decommission monitor, we should store the blocks in a structure that
allows them to be retrieved efficiently in a random order. Perhaps a hashmap,
but there may be something better.
# This means that the blocks from each disk and each node will be naturally in
this structure in a random order.
# Every X seconds, check the size of the pending replication queue. If it is
under some threshold, take Y blocks from the 'waiting to decommission block
list', check if they are under-replicated, and if so, add them to the
replication queue and store them in a new pending list in the decommission
monitor.
# Periodically check only the blocks in the pending list to see if they have
been replicated, and check if the replication queue is under some threshold,
then repeat the process.
This way we avoid flooding the replication queue with millions of blocks and
release them onto it in a more controlled fashion.
The blocks should be in a random order too.
If a node failure happens, the replication queue will fill up with its blocks
and decommission will temporarily stall, giving priority to recovering the
blocks from the failed node, which is probably desirable.
> Do Not Remove Blocks Sequentially During Decommission
> ------------------------------------------------------
>
> Key: HDFS-13157
> URL: https://issues.apache.org/jira/browse/HDFS-13157
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: datanode, namenode
> Affects Versions: 3.0.0
> Reporter: David Mollitor
> Assignee: David Mollitor
> Priority: Major
> Attachments: HDFS-13157.1.patch
>
>
> From what I understand of [DataNode
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
> it appears that all the blocks are scheduled for removal _in order._. I'm
> not 100% sure what the ordering is exactly, but I think it loops through each
> data volume and schedules each block to be replicated elsewhere. The net
> affect is that during a decommission, all of the DataNode transfer threads
> slam on a single volume until it is cleaned out. At which point, they all
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution
> across all volumes when decommissioning a node.
--
This message was sent by Atlassian Jira
(v8.3.2#803003)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]