[ 
https://issues.apache.org/jira/browse/HDFS-1765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052086#comment-13052086
 ] 

Haryadi Gunawi commented on HDFS-1765:
--------------------------------------

I agree with Hairong. Recently, I've been playing around with this, and found 
the same problem as shown in the attachment (underReplicatedQueue.pdf).

At a high-level, if the round-robin iterator is in queue-2 (queue with 
priority=2), then the UR blocks in queue-0 must wait until the iterator wraps 
to queue-0 again.  So, I assume, in worst case, if queue-2 is long (as depicted 
in the graph), the UR blocks in queue-0 will take a very long time to be served!

The setup of the figure:
I have 20 nodes.  Each node holds 3000 blocks. I fail 4 nodes.
q-0: UR blocks with 1 replica
q-2: UR blocks with 2 replicas
pq: pending queue
(I stopped the experiment in the middle, because the pattern is obvious)

More details why the round-robin iterator does not work:

It is true that round-robin iterates through queue-0 first,
but the replication monitor runs this logic:
- choose a block B to be replicated
- pick a source node S that still has B 
- BUT if S were already chosen to replicate other blocks 
  (i.e. S' rep stream is already larger than the maxrepstream(2)),
  then increment the iterator (and thus this block B in queue-0
  will not be served until the round-robin iterator wraps).
  And if other queues (e.g. q1 and q2) are super long, then queue-0
  might be starved for a long time.



> Block Replication should respect under-replication block priority
> -----------------------------------------------------------------
>
>                 Key: HDFS-1765
>                 URL: https://issues.apache.org/jira/browse/HDFS-1765
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: name-node
>    Affects Versions: 0.23.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.23.0
>
>
> Currently under-replicated blocks are assigned different priorities depending 
> on how many replicas a block has. However the replication monitor works on 
> blocks in a round-robin fashion. So the newly added high priority blocks 
> won't get replicated until all low-priority blocks are done. One example is 
> that on decommissioning datanode WebUI we often observe that "blocks with 
> only decommissioning replicas" do not get scheduled to replicate before other 
> blocks, so risking data availability if the node is shutdown for repair 
> before decommission completes.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to