[
https://issues.apache.org/jira/browse/HDFS-5442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13857276#comment-13857276
]
Jerry Chen commented on HDFS-5442:
----------------------------------
Thanks, Liu Lei.
{quote} 1. If I want store three replicas in secondary cluster. With
synchronous data writing, does the primary cluster need to maintain
NetworkTopology of datanodes in secondary cluster{quote}
Good question. With synchronous data writing, too minimize the latency of data
writing, only one replica can be written to the secondary cluster and use the
secondary cluster replication to extend the data block to other data nodes in
secondary cluster. But the number of data nodes (replicas) included in the
synchronous data writing pipeline can be configured. If more than one replicas
needs to be written at the time of synchronous data writing pipeline, then the
NetworkTopology information needs to be available in the primary cluster.
We are also considered other possibility to avoid data node availability and
NetworkTopology information sharing between the secondary (mirror) cluster and
primary cluster. For example, the primary cluster can make an RPC call to
mirror active NameNode to choose the targets for every new block allocation.
The disadvantage of this approach is it will add more latency when allocating
new blocks.
{quote}2. If the active namenode of secondary cluster has not received
heartbeat msg from a datanode for more than 30s , the datanode will be marked
and treated as "stale" default. These stale datanodes will be not written data.
When datanodes in secondary cluster is become "stale", does the active namenode
of secondary cluster send DR_DN_AVAILABLE command to namenodes in primary
cluster? And when the "stale" is become lived, does the active namenode of
secondary cluster send DR_DN_AVAILABLE command to namenodes in primary
cluster?{quote}
Yes. The basic concept is active NameNode of secondary cluster will share these
changes to the primary cluster via heartbeats to make the primary cluster
Active NameNode to realize the stale of data nodes or new available data nodes.
{quote}3. When secondary cluster become primary cluster, can the client
automatically switch to secondary cluster?{quote}
As a cross datacenter recovery mechanism, we currently design to manually
switch the primary and secondary cluster roles. While the HA Active and Standby
NameNode switching can be more automatically when they are in the same
datacenter. Pure technically, switching between primary cluster and secondary
cluster can be automatic similar to the way as current HA. Considering cross
datacenter network would be far instable than an internal network and other
factors, automatically switching of primary and secondary cluster role would be
more risky and unexpected. The same thing is on the client side. Logically and
technically, new FailoverProxyProvider can provide failover proxies between
primary and secondary NameNodes. So the question is on "Is the automatically
failover what you expected for cross datacenter situation?"
> Zero loss HDFS data replication for multiple datacenters
> --------------------------------------------------------
>
> Key: HDFS-5442
> URL: https://issues.apache.org/jira/browse/HDFS-5442
> Project: Hadoop HDFS
> Issue Type: Improvement
> Reporter: Avik Dey
> Attachments: Disaster Recovery Solution for Hadoop.pdf, Disaster
> Recovery Solution for Hadoop.pdf
>
>
> Hadoop is architected to operate efficiently at scale for normal hardware
> failures within a datacenter. Hadoop is not designed today to handle
> datacenter failures. Although HDFS is not designed for nor deployed in
> configurations spanning multiple datacenters, replicating data from one
> location to another is common practice for disaster recovery and global
> service availability. There are current solutions available for batch
> replication using data copy/export tools. However, while providing some
> backup capability for HDFS data, they do not provide the capability to
> recover all your HDFS data from a datacenter failure and be up and running
> again with a fully operational Hadoop cluster in another datacenter in a
> matter of minutes. For disaster recovery from a datacenter failure, we should
> provide a fully distributed, zero data loss, low latency, high throughput and
> secure HDFS data replication solution for multiple datacenter setup.
> Design and code for Phase-1 to follow soon.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)