[ 
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16921211#comment-16921211
 ] 

Chen Zhang commented on HDFS-13157:
-----------------------------------

Thanks [~sodonnell] for your detailed analysis, since the comments are very 
long, I'll try to outline the conclusion we made util now:
 # When decommissioning one DN, the blocks is added to replication queue one 
disk by another.
 # When scheduling replication works, the redundancyMonitor get the blocks from 
highest priority queue one by one, so for each DN, the blocks will concentrate 
on 1 disk.
 # In some cases, the decommission may happen on only one DN (e.g. first 
liveNode*2 blocks are all from one DN, and the blocks have no overlap with 
other decommissioning nodes).
 ## *My thoughts here*: actually in most case, all DN will decommission in 
progress at the same time ,but the 1st node will be much faster, which is also 
not desired by us
 # [~sodonnell] observed that the time the NN lock is held when processing a 
node for decommission maybe very long.

 

Actually we've encountered all these problem on our online cluster, I'd like to 
share our solutions here:
 # Add a new replication queue implementation, it randomized the order of 
getting blocks from it.
 # Add a configuration, which makes NN to release the lock every 
10000(configurable) blocks.

*I think randomize replication queue is better than randomize the 
blockIterator,* because randomize blockIterator won't resolve the issue-3 we 
mentioned above and makes the optimization of issue-4 harder.

*But randomize the replication queue have some other bad impactions that we 
should consider here*: usually the blocks in the head should schedule before 
the blocks in the tail, because the longer time a block waiting for 
replication, the probability of data loss is bigger, but randomize the order 
makes it's impossible.

> Do Not Remove Blocks Sequentially During Decommission 
> ------------------------------------------------------
>
>                 Key: HDFS-13157
>                 URL: https://issues.apache.org/jira/browse/HDFS-13157
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: datanode, namenode
>    Affects Versions: 3.0.0
>            Reporter: David Mollitor
>            Assignee: David Mollitor
>            Priority: Major
>         Attachments: HDFS-13157.1.patch
>
>
> From what I understand of [DataNode 
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
>  it appears that all the blocks are scheduled for removal _in order._. I'm 
> not 100% sure what the ordering is exactly, but I think it loops through each 
> data volume and schedules each block to be replicated elsewhere. The net 
> affect is that during a decommission, all of the DataNode transfer threads 
> slam on a single volume until it is cleaned out. At which point, they all 
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution 
> across all volumes when decommissioning a node.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to