[ 
https://issues.apache.org/jira/browse/HDFS-5442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13851209#comment-13851209
 ] 

Jerry Chen commented on HDFS-5442:
----------------------------------

{quote}There are two clusters in your design document: the primary cluster and 
the secondary cluster. I think we only need one cluster.{quote}
We think it is important to have clear communication and collaboration boundary 
between the regions (datacenters) for the following reasons:

1.       When one datacenter fails, another datacenter should take over with a 
symmetric HA cluster, rather than leave a single cluster with reduced 
resources. 

2.       With a single cluster approach, the impact to the existing HDFS 
deployment concept is huge. A HDFS cluster is no longer one Active NameNode and 
Standby NameNode. It will span multiple “regions” and with “two Standby 
NameNode in each region”. And also the DataNodes are split into regions and the 
block locations are not shared between NameNodes of different regions, but they 
belong to a single HDFS cluster. These concept changes will further impact more 
on existing Hadoop operations and tooling. 

3.       With a single cluster approach, straightforwardly, operations must 
manage all the nodes across datacenters. This may cause unnecessary cross site 
communications. And also, it loses the flexibility of managing each datacenter 
separately. 

4.       We avoid larger questions such as how upper level components such as 
HBase and YARN could be deployed and run over a single HDFS system across 
multiple sites.

And as to QJM, although it can in theory span datacenters, the peers in the 
backup datacenters could easily be out of date when the primary fails. This is 
because only a majority needs to agree without consideration for location. 


> Zero loss HDFS data replication for multiple datacenters
> --------------------------------------------------------
>
>                 Key: HDFS-5442
>                 URL: https://issues.apache.org/jira/browse/HDFS-5442
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Avik Dey
>         Attachments: Disaster Recovery Solution for Hadoop.pdf
>
>
> Hadoop is architected to operate efficiently at scale for normal hardware 
> failures within a datacenter. Hadoop is not designed today to handle 
> datacenter failures. Although HDFS is not designed for nor deployed in 
> configurations spanning multiple datacenters, replicating data from one 
> location to another is common practice for disaster recovery and global 
> service availability. There are current solutions available for batch 
> replication using data copy/export tools. However, while providing some 
> backup capability for HDFS data, they do not provide the capability to 
> recover all your HDFS data from a datacenter failure and be up and running 
> again with a fully operational Hadoop cluster in another datacenter in a 
> matter of minutes. For disaster recovery from a datacenter failure, we should 
> provide a fully distributed, zero data loss, low latency, high throughput and 
> secure HDFS data replication solution for multiple datacenter setup.
> Design and code for Phase-1 to follow soon.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Reply via email to