[
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16921331#comment-16921331
]
Stephen O'Donnell commented on HDFS-13157:
------------------------------------------
{quote}
# Add a configuration, which makes NN to release the lock every
10000(configurable) blocks.
\{quote}
There was some discussion in related to this in HDFS-10477, and they decided to
drop the lock after processing each storage. The reason, is that the iterator
for the storage could get a ConcurrentModificationException if its contents
change when the lock is dropped and retaken. Locking at the storage level is
probably a good middle ground between how it works currently locking on a block
count threshold.
--Thinking about the problem on replicating older blocks first ... We currently
have several replication queues, and blocks with only 1 replica should go into
the highest priority queue. That means other blocks (only 2 replicas) and
decommissioning blocks are in the 'normal' queue. Looking at how that queue is
currently processed, it begins at the start and:
# Gets 2 * live_nodes blocks
# Attempts to schedule them for replication based on max-streams limits
# Any that are not scheduled are simply dropped until all other blocks have
been tried and the iterator cycles round.
Therefore even in the current implementation some of the blocks can get left
behind for some time.
This does seem to be a tricky problem to get correct, as there are quite a few
edge cases and scenarios to consider.
> Do Not Remove Blocks Sequentially During Decommission
> ------------------------------------------------------
>
> Key: HDFS-13157
> URL: https://issues.apache.org/jira/browse/HDFS-13157
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: datanode, namenode
> Affects Versions: 3.0.0
> Reporter: David Mollitor
> Assignee: David Mollitor
> Priority: Major
> Attachments: HDFS-13157.1.patch
>
>
> From what I understand of [DataNode
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
> it appears that all the blocks are scheduled for removal _in order._. I'm
> not 100% sure what the ordering is exactly, but I think it loops through each
> data volume and schedules each block to be replicated elsewhere. The net
> affect is that during a decommission, all of the DataNode transfer threads
> slam on a single volume until it is cleaned out. At which point, they all
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution
> across all volumes when decommissioning a node.
--
This message was sent by Atlassian Jira
(v8.3.2#803003)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]