[
https://issues.apache.org/jira/browse/HDFS-5442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14021165#comment-14021165
]
Vinayakumar B commented on HDFS-5442:
-------------------------------------
Hi Chris, Thanks for taking a look at the design.
Yes, I agree that, documentation is very important task for this feature and I
am sure with all your support we can make a very good documentation for this.
{quote}IMO, the choice to activate the mirror must not be taken lightly.
Suppose there is a network partition such that all nodes in DC1 have
connectivity to each other, and all nodes in DC2 have connectivity to each
other, but there is no connectivity between DC1 and DC2. In this scenario, it's
possible that client applications are still generating edits inside DC1, but an
operator in DC2 won't be able to determine that. If an operator activates the
mirror in DC2, starts running client applications in DC2, and then connectivity
is restored between DC1 and DC2, then we have a split-brain scenario. Since
there is no reconciliation in the design, I believe the operator's only choice
at that point is to discard completely the instance in DC1 or the instance in
DC2. This is another area needing clear documentation, so that system
administrators fully understand the risk.{quote}
Yes, There will be always a chance of split-brain scenarion when
inter-datacentre connection gets broken. Here there are multiple things needs
to be taken care before activating the mirror. Of-course its manual ( at least
for now).
1. Validating whether, primary DC is down or just the connection broken b/w
datacenters.
2. In case of just connection broken b/w datacenters, in synchronous mode,
all client operations will fail, as the edits writing synchronously will fail.
But in case of async mode, operations will not fail, but there will be a resync
mechanism.
3. Since as mentioned above, there will not be chance of missing the edits,
Operator need not discard complete instance of any DC.
{quote}The jira description states as a goal that a replicated copy of the
cluster should be running again in another DC "in a matter of minutes". Just as
a meta-observation, it appears that the process of activating the mirror in the
second DC would require significant reconfiguration of the mirror, a restart,
and then of course waiting on block reports to get out of safe mode. This makes
me wonder if a system administrator really can execute on this in a matter of
just minutes. I apologize if I'm misunderstanding. I'll start taking a look at
the patches on the sub-tasks, and perhaps that will shed some further
light.{quote}
Basically, there is a need of changing only one configuration manually to
activate the mirror as per current design ( dfs.region.primary ) and restart
the cluster to activate this. Of-Course we can improve this further if any auto
failover mechanism is implemented in further.
In case of asynchronous mode and if some blocks are missed to replicate from
primary deleting missed blocks needs to be done after activating and restarting
the cluster.
But in case of synchronous mode, all recent data will be available at both
primary and mirror clusters, in this case, Namenode will stay for small amount
of time in safemode due to restart.
{quote}Have you considered the workflow for doing a rolling upgrade on a
paired primary and mirror? I suspect there are some challenges here around
coordinating the edit log roll and the edit log op codes to start or finalize a
rolling upgrade. It seems the mirror must have the new software version fully
deployed first, before new edits related to new features start flowing from the
primary.{quote}
I think, Rolling upgrade was not considered while making this design. As of
now its assumed that both clusters have same version of hadoop running.
Log Rolling of both edits is considered carefully and implemented. You can
review the patch in one of the subtask.
{quote}Is the DataNode aware of its region ID? If so, then the DataNode could
check the region ID of the NameNode after registration to prevent
misconfigurations, similar to what we do for block pool ID and cluster
ID.{quote}
As of now, datanode will register to the its own region's namenode. There is
no across DC datanodes registration. And regionId is not carried via RPC, its
just a configuration identification item to identify the correct namenode.
Hope I have answered all of your queries. Always suggestions/queries are
welcome. :)
> Zero loss HDFS data replication for multiple datacenters
> --------------------------------------------------------
>
> Key: HDFS-5442
> URL: https://issues.apache.org/jira/browse/HDFS-5442
> Project: Hadoop HDFS
> Issue Type: Improvement
> Reporter: Avik Dey
> Assignee: Dian Fu
> Attachments: Disaster Recovery Solution for Hadoop.pdf, Disaster
> Recovery Solution for Hadoop.pdf
>
>
> Hadoop is architected to operate efficiently at scale for normal hardware
> failures within a datacenter. Hadoop is not designed today to handle
> datacenter failures. Although HDFS is not designed for nor deployed in
> configurations spanning multiple datacenters, replicating data from one
> location to another is common practice for disaster recovery and global
> service availability. There are current solutions available for batch
> replication using data copy/export tools. However, while providing some
> backup capability for HDFS data, they do not provide the capability to
> recover all your HDFS data from a datacenter failure and be up and running
> again with a fully operational Hadoop cluster in another datacenter in a
> matter of minutes. For disaster recovery from a datacenter failure, we should
> provide a fully distributed, zero data loss, low latency, high throughput and
> secure HDFS data replication solution for multiple datacenter setup.
> Design and code for Phase-1 to follow soon.
--
This message was sent by Atlassian JIRA
(v6.2#6252)