Extend UnderReplicatedBlocks queue to give blocks whose existing copies are all
on a single rack priority over multi-rack blocks
--------------------------------------------------------------------------------------------------------------------------------
Key: HDFS-2472
URL: https://issues.apache.org/jira/browse/HDFS-2472
Project: Hadoop HDFS
Issue Type: Improvement
Components: data-node
Reporter: Steve Loughran
Assignee: Steve Loughran
Priority: Minor
Google's Availability in Globally Distributed Storage Systems paper
[http://research.google.com/pubs/archive/36737.pdf] shows that storage node
failures are often correlated with other failures on the same rack. This could
be due to rack-local failures: switches, power, etc, or operations actions
(take down a whole rack for a rolling OS upgrade). Whatever the root cause, the
paper argues (section 5.2) that rack-aware placement would increase the MTTF of
a single block by a factor of three (typically). Some decisions can be made a
block placement time, but that would be a separate issue.
Here I propose giving priority to blocks that are under replicated and where
all blocks are on the same rack above those blocks that are under-replicated
and the remaining blocks are on different racks.
# Provided the failure does not take down the entire rack in one go (e.g.
switch failure), this policy would decrease the time in which all blocks would
be on a single rack, so reduce the consequences of a rack failure.
# On a single-rack system, all under-replicated blocks would go into this
queue, so the state would effectively be that of today's system.
This may make the demand for off-rack bandwidth ramp up immediately, because
priority will be given to blocks that must be replicated off rack. I am not
sure that it will, however, as multi-rack replication policies would generate
the same amount of traffic to re-replicate the blocks anyway.
The main barrier to implementing this feature is that currently the
UnderReplicatedBlocks queues do not get provided with any information about
block locality. We'd need to change the add() method to take information about
where the current replicas are, and the BlockManager would have to provide some
information as to whether the block was rack-local, not-rack local or unknown;
the latter because the PendingReplicationBlocks structure does not know where
things come from. "unknown" items would have to be given the same priority as
rack-only blocks (pessimistic) or the same priority as multi-rack blocks
(optimistic). I would be biased towards the pessimistic approach as it would
ensure that on single-rack systems there would be no obvious change in
behaviour. On a multi-rack system it would give priority to timed out
PendingReplication blocks ahead of multi-rack under-replicated blocks. That may
be a good thing in itself
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira