[ 
https://issues.apache.org/jira/browse/HDFS-14861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16947887#comment-16947887
 ] 

Stephen O'Donnell commented on HDFS-14861:
------------------------------------------

I have uploaded a first go at this change.

This adds a new config key 
"DFSConfigKeys.DFS_NAMENODE_REDUNDANCY_QUEUE_RESET_ITERATIONS_DEFAULT" which 
defaults to 2400. As the redundancy monitor runs every 3 seconds, this means it 
will take about 2400*3 seconds = 2 hours to reset the iterators. Setting the 
key to zero disables this change and it will work as before, only resetting the 
iterators after the queues have all reached their end.

LowRedundancyBlocks does not currently get a conf object passed in, so I opted 
to pass a value into its constructor and grab the setting from config in 
BlockManager, which create the LowRedundancyBlocks object. Its not an ideal way 
to do this, and the value is only used in chooseLowRedundancyBlocks and nowhere 
else in the class, so its a little confusing.

I still need to add some tests, but I wanted to share what I had for now.

> Reset LowRedundancyBlocks Iterator periodically
> -----------------------------------------------
>
>                 Key: HDFS-14861
>                 URL: https://issues.apache.org/jira/browse/HDFS-14861
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>    Affects Versions: 3.3.0
>            Reporter: Stephen O'Donnell
>            Assignee: Stephen O'Donnell
>            Priority: Major
>
> When the namenode needs to schedule blocks for reconstruction, the blocks are 
> placed into the neededReconstruction object in the BlockManager. This is an 
> instance of LowRedundancyBlocks, which maintains a list of priority queues 
> where the blocks are held until they are scheduled for reconstruction / 
> replication.
> Every 3 seconds, by default, a number of blocks are retrieved from 
> LowRedundancyBlocks. The method 
> LowRedundancyBlocks.chooseLowRedundancyBlocks() is used to retrieve the next 
> set of blocks using a bookmarked iterator. Each call to this method moves the 
> iterator forward. The number of blocks retrieved is governed by the formula:
> number_of_live_nodes * dfs.namenode.replication.work.multiplier.per.iteration 
> (default 2)
> Then the namenode attempts to schedule those blocks on datanodes, but each 
> datanode has a limit of how many blocks can be queued against it (controlled 
> by dfs.namenode.replication.max-streams) so all of the retrieved blocks may 
> not be scheduled. There may be other block availability reasons the blocks 
> are not scheduled too.
> As the iterator in chooseLowRedundancyBlocks() always moves forward, the 
> blocks which were not scheduled are not retried until the end of the queue is 
> reached and the iterator is reset.
> If the replication queue is very large (eg several nodes are being 
> decommissioned) or if blocks are being continuously added to the replication 
> queue (eg nodes decommission using the proposal in HDFS-14854) it may take a 
> very long time for the iterator to be reset to the start.
> The result of this, could be a few blocks for a decommissioning or entering 
> maintenance mode node getting left behind and it taking many hours or even 
> days for them to be retried, and this could stop decommission completing.
> With this Jira, I would like to suggest we reset the iterator after a 
> configurable number of calls to chooseLowRedundancyBlocks() so any left 
> behind blocks are retried.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to