[
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16925523#comment-16925523
]
Stephen O'Donnell commented on HDFS-13157:
------------------------------------------
> How is it handled, iterating through each DataNode, that a block is scheduled
> to be replicated onto a DataNode that will be decommissioned further down in
> the list?
The block manager takes care of this when allocating a new block target. Nodes
that are in a decommissioning state will not be considered as a new target.
In the current decommissioning implementation, the nodes selected for
decommissioning are processed in a very conservative way, in that for nodes
with more than 500K blocks (dfs.namenode.decommission.blocks.per.interval) it
will process one node, and then sleep for 30 seconds before processing the next
one. I believe this is to prevent locking the Namenode too often in close
succession.
When processing a node, we really need to hold a lock while processing some
unit of work. The work in the datanode is split into its storage volumes, and
we use an iterator to process all the blocks on the iterator. If you drop the
lock part way through that iterator, then a block report or file modification
in HDFS can change the contents of the storage and then the iterator will get a
concurrent modification exception. Therefore interleaving many DNs for
processing at the same time is tricky. Each one needs an exclusive lock and
they will all be contenting for it. If we drop and re-take the lock for each
block we will need to bookmark the iterator and handle
concurrentModificationException, possibly frequently.
There is also no guarantee a user would not decomm node 1, then 10 minutes
later decom node 2 and so on, and the suggested strategy would not help with
that.
I still believe the simplest fix to this issue is to change the implementation
of the pending replication queue to process it in a random order rather than
FIFO, but, that does not help deal with nodes which have had some blocks
skipped on the first pass and need to be processed a second time, but we may be
able to solve that by retrying them a few times as you also suggested.
> Do Not Remove Blocks Sequentially During Decommission
> ------------------------------------------------------
>
> Key: HDFS-13157
> URL: https://issues.apache.org/jira/browse/HDFS-13157
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: datanode, namenode
> Affects Versions: 3.0.0
> Reporter: David Mollitor
> Assignee: David Mollitor
> Priority: Major
> Attachments: HDFS-13157.1.patch
>
>
> From what I understand of [DataNode
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
> it appears that all the blocks are scheduled for removal _in order._. I'm
> not 100% sure what the ordering is exactly, but I think it loops through each
> data volume and schedules each block to be replicated elsewhere. The net
> affect is that during a decommission, all of the DataNode transfer threads
> slam on a single volume until it is cleaned out. At which point, they all
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution
> across all volumes when decommissioning a node.
--
This message was sent by Atlassian Jira
(v8.3.2#803003)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]