[
https://issues.apache.org/jira/browse/HDFS-2472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Work on HDFS-2472 started by Steve Loughran.
> Extend UnderReplicatedBlocks queue to give blocks whose existing copies are
> all on a single rack priority over multi-rack blocks
> --------------------------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-2472
> URL: https://issues.apache.org/jira/browse/HDFS-2472
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: data-node
> Reporter: Steve Loughran
> Assignee: Steve Loughran
> Priority: Minor
>
> Google's Availability in Globally Distributed Storage Systems paper
> [http://research.google.com/pubs/archive/36737.pdf] shows that storage node
> failures are often correlated with other failures on the same rack. This
> could be due to rack-local failures: switches, power, etc, or operations
> actions (take down a whole rack for a rolling OS upgrade). Whatever the root
> cause, the paper argues (section 5.2) that rack-aware placement would
> increase the MTTF of a single block by a factor of three (typically). Some
> decisions can be made a block placement time, but that would be a separate
> issue.
> Here I propose giving priority to blocks that are under replicated and where
> all blocks are on the same rack above those blocks that are under-replicated
> and the remaining blocks are on different racks.
> # Provided the failure does not take down the entire rack in one go (e.g.
> switch failure), this policy would decrease the time in which all blocks
> would be on a single rack, so reduce the consequences of a rack failure.
> # On a single-rack system, all under-replicated blocks would go into this
> queue, so the state would effectively be that of today's system.
> This may make the demand for off-rack bandwidth ramp up immediately, because
> priority will be given to blocks that must be replicated off rack. I am not
> sure that it will, however, as multi-rack replication policies would generate
> the same amount of traffic to re-replicate the blocks anyway.
> The main barrier to implementing this feature is that currently the
> UnderReplicatedBlocks queues do not get provided with any information about
> block locality. We'd need to change the add() method to take information
> about where the current replicas are, and the BlockManager would have to
> provide some information as to whether the block was rack-local, not-rack
> local or unknown; the latter because the PendingReplicationBlocks structure
> does not know where things come from. "unknown" items would have to be given
> the same priority as rack-only blocks (pessimistic) or the same priority as
> multi-rack blocks (optimistic). I would be biased towards the pessimistic
> approach as it would ensure that on single-rack systems there would be no
> obvious change in behaviour. On a multi-rack system it would give priority to
> timed out PendingReplication blocks ahead of multi-rack under-replicated
> blocks. That may be a good thing in itself
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira