[
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16921211#comment-16921211
]
Chen Zhang commented on HDFS-13157:
-----------------------------------
Thanks [~sodonnell] for your detailed analysis, since the comments are very
long, I'll try to outline the conclusion we made util now:
# When decommissioning one DN, the blocks is added to replication queue one
disk by another.
# When scheduling replication works, the redundancyMonitor get the blocks from
highest priority queue one by one, so for each DN, the blocks will concentrate
on 1 disk.
# In some cases, the decommission may happen on only one DN (e.g. first
liveNode*2 blocks are all from one DN, and the blocks have no overlap with
other decommissioning nodes).
## *My thoughts here*: actually in most case, all DN will decommission in
progress at the same time ,but the 1st node will be much faster, which is also
not desired by us
# [~sodonnell] observed that the time the NN lock is held when processing a
node for decommission maybe very long.
Actually we've encountered all these problem on our online cluster, I'd like to
share our solutions here:
# Add a new replication queue implementation, it randomized the order of
getting blocks from it.
# Add a configuration, which makes NN to release the lock every
10000(configurable) blocks.
*I think randomize replication queue is better than randomize the
blockIterator,* because randomize blockIterator won't resolve the issue-3 we
mentioned above and makes the optimization of issue-4 harder.
*But randomize the replication queue have some other bad impactions that we
should consider here*: usually the blocks in the head should schedule before
the blocks in the tail, because the longer time a block waiting for
replication, the probability of data loss is bigger, but randomize the order
makes it's impossible.
> Do Not Remove Blocks Sequentially During Decommission
> ------------------------------------------------------
>
> Key: HDFS-13157
> URL: https://issues.apache.org/jira/browse/HDFS-13157
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: datanode, namenode
> Affects Versions: 3.0.0
> Reporter: David Mollitor
> Assignee: David Mollitor
> Priority: Major
> Attachments: HDFS-13157.1.patch
>
>
> From what I understand of [DataNode
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
> it appears that all the blocks are scheduled for removal _in order._. I'm
> not 100% sure what the ordering is exactly, but I think it loops through each
> data volume and schedules each block to be replicated elsewhere. The net
> affect is that during a decommission, all of the DataNode transfer threads
> slam on a single volume until it is cleaned out. At which point, they all
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution
> across all volumes when decommissioning a node.
--
This message was sent by Atlassian Jira
(v8.3.2#803003)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]