[ 
https://issues.apache.org/jira/browse/HDFS-14861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16992724#comment-16992724
 ] 

Stephen O'Donnell commented on HDFS-14861:
------------------------------------------

I discussed this issue with Wei-Chiu offline and we had a couple of concerns:

1. If we are resetting the iterator periodically, and there are a lot of 
missing blocks, how will that affect things

2. Is there any better way we can detect if the iterator needs reset.

For (1) - missing / corrupt blocks go into the lowest priority queue which is 
not accessed by the iterators in question here. The iterators are used in 
chooseLowRedundancyBlocks:

{code}
  synchronized List<List<BlockInfo>> chooseLowRedundancyBlocks(
      int blocksToProcess) {
    final List<List<BlockInfo>> blocksToReconstruct = new ArrayList<>(LEVEL);

    int count = 0;
    int priority = 0;
    for (; count < blocksToProcess && priority < LEVEL; priority++) {
      if (priority == QUEUE_WITH_CORRUPT_BLOCKS) {
        // do not choose corrupted blocks.
        continue;
      }

      // Go through all blocks that need reconstructions with current priority.
      // Set the iterator to the first unprocessed block at this priority level
      final Iterator<BlockInfo> i = priorityQueues.get(priority).getBookmark();
      ...
{code}

The corrupt / missing blocks will all be in QUEUE_WITH_CORRUPT_BLOCKS and hence 
are not processed by this method. Therefore we don't need to worry about them 
with this change.

For (2) - it is difficult to come up with something other than a time based 
metric. The reason is that each queue is effectively a double linked list and 
the iterator bookmark just points to the next element. Given that element, we 
have no knowledge as to how many blocks are behind that point, or ahead of it. 
Ideally we want to reset the iterator if there is some threshold of blocks 
behind the pointer, as those are the blocks which got skipped for some reason. 
The only way to see how many blocks are behind is to read the list from the 
start until you encounter the same element as the iterator returns which would 
not be very efficient. The easy solution is to simply reset the iterator to the 
start after some amount of time, but its hard to know what the best period of 
time would be.

It may be useful to create a command to dump the contents of lowReduncanyBlocks 
in a separate Jira which could give some further insights into the queues, 
especially if decommission is stuck for seemingly no reason and also let us see 
often this problem occurs on a real cluster.

> Reset LowRedundancyBlocks Iterator periodically
> -----------------------------------------------
>
>                 Key: HDFS-14861
>                 URL: https://issues.apache.org/jira/browse/HDFS-14861
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>    Affects Versions: 3.3.0
>            Reporter: Stephen O'Donnell
>            Assignee: Stephen O'Donnell
>            Priority: Major
>              Labels: decommission
>         Attachments: HDFS-14861.001.patch, HDFS-14861.002.patch
>
>
> When the namenode needs to schedule blocks for reconstruction, the blocks are 
> placed into the neededReconstruction object in the BlockManager. This is an 
> instance of LowRedundancyBlocks, which maintains a list of priority queues 
> where the blocks are held until they are scheduled for reconstruction / 
> replication.
> Every 3 seconds, by default, a number of blocks are retrieved from 
> LowRedundancyBlocks. The method 
> LowRedundancyBlocks.chooseLowRedundancyBlocks() is used to retrieve the next 
> set of blocks using a bookmarked iterator. Each call to this method moves the 
> iterator forward. The number of blocks retrieved is governed by the formula:
> number_of_live_nodes * dfs.namenode.replication.work.multiplier.per.iteration 
> (default 2)
> Then the namenode attempts to schedule those blocks on datanodes, but each 
> datanode has a limit of how many blocks can be queued against it (controlled 
> by dfs.namenode.replication.max-streams) so all of the retrieved blocks may 
> not be scheduled. There may be other block availability reasons the blocks 
> are not scheduled too.
> As the iterator in chooseLowRedundancyBlocks() always moves forward, the 
> blocks which were not scheduled are not retried until the end of the queue is 
> reached and the iterator is reset.
> If the replication queue is very large (eg several nodes are being 
> decommissioned) or if blocks are being continuously added to the replication 
> queue (eg nodes decommission using the proposal in HDFS-14854) it may take a 
> very long time for the iterator to be reset to the start.
> The result of this, could be a few blocks for a decommissioning or entering 
> maintenance mode node getting left behind and it taking many hours or even 
> days for them to be retried, and this could stop decommission completing.
> With this Jira, I would like to suggest we reset the iterator after a 
> configurable number of calls to chooseLowRedundancyBlocks() so any left 
> behind blocks are retried.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to