[
https://issues.apache.org/jira/browse/HDFS-14786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16919931#comment-16919931
]
Mingliang Liu commented on HDFS-14786:
--------------------------------------
[~surendrasingh] For now in public cloud, there are not many regions which have
more than 4 AZs. In our use case, most if not all data have exactly 3 replicas,
so adding one more AZ does not significantly improve HDFS data availability as
a whole. Thanks for bring up erasure coding (EC, really good point. And in EC,
data+parity blocks are distributed evenly across the racks. The default policy
{{BlockPlacementPolicyRackFaultTolerant}} works perfectly fine for EC if we
don't need to tolerate higher level, call it *zone* (meaning "datacenter" in
traditional 1P data-center, or "AZ" in public cloud). But since the racks are
chosen randomly regardless zone distribution, it would not tolerate AZ failure
in public cloud.
The example you gave makes perfect sense to me. I think this new block
placement policy could also be used by EC as well. I think the implementation
may not need to make much assumption regarding replica vs. EC.
> A new block placement policy tolerating availability zone failure
> -----------------------------------------------------------------
>
> Key: HDFS-14786
> URL: https://issues.apache.org/jira/browse/HDFS-14786
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: block placement
> Reporter: Mingliang Liu
> Priority: Major
>
> {{NetworkTopology}} assumes "/datacenter/rack/host" 3 layer topology. Default
> block placement policies are rack awareness for better fault tolerance. Newer
> block placement policy like {{BlockPlacementPolicyRackFaultTolerant}} tries
> its best to place the replicas to most racks, which further tolerates more
> racks failing. HADOOP-8470 brought {{NetworkTopologyWithNodeGroup}} to add
> another layer under rack, i.e. "/datacenter/rack/host/nodegroup" 4 layer
> topology. With that, replicas within a rack can be placed in different node
> groups for better isolation.
> Existing block placement policies tolerate one rack failure since at least
> two racks are chosen in those cases. Chances are all replicas could be placed
> in the same datacenter, though there are multiple data centers in the same
> cluster topology. In other words, fault of higher layers beyond rack is not
> well tolerated.
> However, more deployments in public cloud are leveraging multiple available
> zones (AZ) for high-availability since the inter-AZ latency seems affordable
> in many cases. In a single AZ, some cloud providers like AWS support
> [partitioned placement
> groups|https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/placement-groups.html#placement-groups-partition]
> which basically are different racks. A simple network topology mapped to
> HDFS is "/availabilityzone/rack/host" 3 layers.
> To achieve high availability tolerating zone failure, this JIRA proposes a
> new data placement policy which tries its best to place replicas in most AZs,
> most racks, and most evenly distributed.
> Examples with 3 replicas, we choose racks as following:
> - 1AZ: fall back to {{BlockPlacementPolicyRackFaultTolerant}} to place among
> most racks
> - 2AZ: randomly choose one rack in one AZ and randomly choose two racks in
> the other AZ
> - 3AZ: randomly choose one rack in every AZ
> - 4AZ: randomly choose three AZs and randomly choose one rack in every AZ
> After racks are picked, hosts are chosen randomly within racks honoring local
> storage, favorite nodes, excluded nodes, storage types etc. Data may become
> imbalance if topology is very uneven in AZs. This seems not a problem as in
> public cloud, infrastructure provisioning is more flexible than 1P.
--
This message was sent by Atlassian Jira
(v8.3.2#803003)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]