[
https://issues.apache.org/jira/browse/HDFS-4270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13511433#comment-13511433
]
Daryn Sharp commented on HDFS-4270:
-----------------------------------
I don't fully understand this subsystem, but I'm a bit torn over limiting the
"only 1 block left" replications. This _should_ be a rare event, but it's a
critical situation when it does. I'm unclear if the max repls is compared
against inflight or queued repls. If the latter, perhaps higher prio blocks
should displace already queued blocks for that DN? If the "only 1 block left"
repls are subjected to a new hard limit, is there an issue with how quickly the
monitor will cycle back to schedule the critical blocks?
Based on an actual incident: we lost most of a rack, then just so happened to
lose the third DN before replication occurred. A lot of nodes were being
decommissioned, which appears to have delayed replication after the first DN
was lost and again after the second DN on the rack was lost. The third DN's
disk with the remaining replica died hours later, and was decommissioned with
no notification that the block was lost. There may be more bugs involved, but
this seemed like an obvious fix to mitigate the risk.
> Replications of the highest priority should be allowed to choose a source
> datanode that has reached its max replication limit
> -----------------------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-4270
> URL: https://issues.apache.org/jira/browse/HDFS-4270
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: namenode
> Affects Versions: 3.0.0, 0.23.5
> Reporter: Derek Dagit
> Assignee: Derek Dagit
> Priority: Minor
> Attachments: HDFS-4270-branch-0.23.patch, HDFS-4270.patch
>
>
> Blocks that have been identified as under-replicated are placed on one of
> several priority queues. The highest priority queue is essentially reserved
> for situations in which only one replica of the block exists, meaning it
> should be replicated ASAP.
> The ReplicationMonitor periodically computes replication work, and a call to
> BlockManager#chooseUnderReplicatedBlocks selects a given number of
> under-replicated blocks, choosing blocks from the highest-priority queue
> first and working down to the lowest priority queue.
> In the subsequent call to BlockManager#computeReplicationWorkForBlocks, a
> source for the replication is chosen from among datanodes that have an
> available copy of the block needed. This is done in
> BlockManager#chooseSourceDatanode.
> chooseSourceDatanode's job is to choose the datanode for replication. It
> chooses a random datanode from the available datanodes that has not reached
> its replication limit (preferring datanodes that are currently
> decommissioning).
> However, the priority queue of the block does not inform the logic. If a
> datanode holds the last remaining replica of a block and has already reached
> its replication limit, the node is dismissed outright and the replication is
> not scheduled.
> In some situations, this could lead to data loss, as the last remaining
> replica could disappear if an opportunity is not taken to schedule a
> replication. It would be better to waive the max replication limit in cases
> of highest-priority block replication.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira