[ 
https://issues.apache.org/jira/browse/HDFS-14786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16919947#comment-16919947
 ] 

Mingliang Liu commented on HDFS-14786:
--------------------------------------

[[email protected]] This is new to me. Glad to know that. Thanks!

Better isolation using replication zones (RZ) could reduce the blast radius 
while availability is not sacrificed. Say there are three replicas, and they 
are placed in 2 racks with different power supply and network switch. If a rack 
is lost, other racks in the same RZ will be a little bit more busy. This can be 
considered better than disturbing the whole cluster/DC, if quality variance 
(could be large) is not acceptable. This seems more a problem for clusters at 
scale.

Our use case and goal is different. In public cloud, we don't have a huge 
cluster or infrastructure. Instead, we have more and small clusters. To provide 
high availability, each cluster will be across 3 AZs. If a rack or zone is 
lost, I would prefer more balanced load to replicate data other than waiting 
for a few nodes which could be too busy to service. As to the implementation, I 
think we can borrow the idea of reducing blast radius. For example, if a rack 
is lost, the block placement group would hopefully favor another rack in the 
same zone. This way, data replication would be mostly intra-AZ, less 
disturbance and monetary cost than inter-AZ. Anyway, the basic idea still 
applies here: block placement policy should take high level topology 
information into consideration.

Also, is there any code, talk or doc about the "replication zone" from Dhruba? 
I will find them helpful I believe.

(when I say "we", I mean the reader and I, me, myself. I don't speak on behalf 
of my employer)

> A new block placement policy tolerating availability zone failure
> -----------------------------------------------------------------
>
>                 Key: HDFS-14786
>                 URL: https://issues.apache.org/jira/browse/HDFS-14786
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: block placement
>            Reporter: Mingliang Liu
>            Priority: Major
>
> {{NetworkTopology}} assumes "/datacenter/rack/host" 3 layer topology. Default 
> block placement policies are rack awareness for better fault tolerance. Newer 
> block placement policy like {{BlockPlacementPolicyRackFaultTolerant}} tries 
> its best to place the replicas to most racks, which further tolerates more 
> racks failing. HADOOP-8470 brought {{NetworkTopologyWithNodeGroup}} to add 
> another layer under rack, i.e. "/datacenter/rack/host/nodegroup" 4 layer 
> topology. With that, replicas within a rack can be placed in different node 
> groups for better isolation.
> Existing block placement policies tolerate one rack failure since at least 
> two racks are chosen in those cases. Chances are all replicas could be placed 
> in the same datacenter, though there are multiple data centers in the same 
> cluster topology. In other words, fault of higher layers beyond rack is not 
> well tolerated.
> However, more deployments in public cloud are leveraging multiple available 
> zones (AZ) for high-availability since the inter-AZ latency seems affordable 
> in many cases. In a single AZ, some cloud providers like AWS support 
> [partitioned placement 
> groups|https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/placement-groups.html#placement-groups-partition]
>  which basically are different racks. A simple network topology mapped to 
> HDFS is "/availabilityzone/rack/host" 3 layers.
> To achieve high availability tolerating zone failure, this JIRA proposes a 
> new data placement policy which tries its best to place replicas in most AZs, 
> most racks, and most evenly distributed.
> Examples with 3 replicas, we choose racks as following:
>  - 1AZ: fall back to {{BlockPlacementPolicyRackFaultTolerant}} to place among 
> most racks
>  - 2AZ: randomly choose one rack in one AZ and randomly choose two racks in 
> the other AZ
>  - 3AZ: randomly choose one rack in every AZ
>  - 4AZ: randomly choose three AZs and randomly choose one rack in every AZ
> After racks are picked, hosts are chosen randomly within racks honoring local 
> storage, favorite nodes, excluded nodes, storage types etc. Data may become 
> imbalance if topology is very uneven in AZs. This seems not a problem as in 
> public cloud, infrastructure provisioning is more flexible than 1P.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to