[
https://issues.apache.org/jira/browse/HDFS-1765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052086#comment-13052086
]
Haryadi Gunawi commented on HDFS-1765:
--------------------------------------
I agree with Hairong. Recently, I've been playing around with this, and found
the same problem as shown in the attachment (underReplicatedQueue.pdf).
At a high-level, if the round-robin iterator is in queue-2 (queue with
priority=2), then the UR blocks in queue-0 must wait until the iterator wraps
to queue-0 again. So, I assume, in worst case, if queue-2 is long (as depicted
in the graph), the UR blocks in queue-0 will take a very long time to be served!
The setup of the figure:
I have 20 nodes. Each node holds 3000 blocks. I fail 4 nodes.
q-0: UR blocks with 1 replica
q-2: UR blocks with 2 replicas
pq: pending queue
(I stopped the experiment in the middle, because the pattern is obvious)
More details why the round-robin iterator does not work:
It is true that round-robin iterates through queue-0 first,
but the replication monitor runs this logic:
- choose a block B to be replicated
- pick a source node S that still has B
- BUT if S were already chosen to replicate other blocks
(i.e. S' rep stream is already larger than the maxrepstream(2)),
then increment the iterator (and thus this block B in queue-0
will not be served until the round-robin iterator wraps).
And if other queues (e.g. q1 and q2) are super long, then queue-0
might be starved for a long time.
> Block Replication should respect under-replication block priority
> -----------------------------------------------------------------
>
> Key: HDFS-1765
> URL: https://issues.apache.org/jira/browse/HDFS-1765
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: name-node
> Affects Versions: 0.23.0
> Reporter: Hairong Kuang
> Assignee: Hairong Kuang
> Fix For: 0.23.0
>
>
> Currently under-replicated blocks are assigned different priorities depending
> on how many replicas a block has. However the replication monitor works on
> blocks in a round-robin fashion. So the newly added high priority blocks
> won't get replicated until all low-priority blocks are done. One example is
> that on decommissioning datanode WebUI we often observe that "blocks with
> only decommissioning replicas" do not get scheduled to replicate before other
> blocks, so risking data availability if the node is shutdown for repair
> before decommission completes.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira