[ 
https://issues.apache.org/jira/browse/HDFS-5442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020570#comment-14020570
 ] 

Chris Nauroth commented on HDFS-5442:
-------------------------------------

Thank you, everyone, for working on this feature.  I've caught up on the design 
document, and I have a few questions and observations:

Since failover from primary to mirror is manual, we have a responsibility to 
document usage very carefully for system administrators.  A few of the 
significant workflows that come to mind are:
# Configure and bootstrap a mirror for an existing deployed cluster.  This 
process is already described at a high level in the design doc.
# Activate the mirror so that it can serve reads and writes.  This is described 
as a manual process, and I can see how it could be achieved by changing various 
configurations and restarting the mirror, but an operator will need clear docs.
# Switch back to the original primary.  This is a second failover after the 
original data center comes back online, and once again I can see a reconfigure 
and restart procedure at a high level.

IMO, the choice to activate the mirror must not be taken lightly.  Suppose 
there is a network partition such that all nodes in DC1 have connectivity to 
each other, and all nodes in DC2 have connectivity to each other, but there is 
no connectivity between DC1 and DC2.  In this scenario, it's possible that 
client applications are still generating edits inside DC1, but an operator in 
DC2 won't be able to determine that.  If an operator activates the mirror in 
DC2, starts running client applications in DC2, and then connectivity is 
restored between DC1 and DC2, then we have a split-brain scenario.  Since there 
is no reconciliation in the design, I believe the operator's only choice at 
that point is to discard completely the instance in DC1 or the instance in DC2. 
 This is another area needing clear documentation, so that system 
administrators fully understand the risk.

The jira description states as a goal that a replicated copy of the cluster 
should be running again in another DC "in a matter of minutes".  Just as a 
meta-observation, it appears that the process of activating the mirror in the 
second DC would require significant reconfiguration of the mirror, a restart, 
and then of course waiting on block reports to get out of safe mode.  This 
makes me wonder if a system administrator really can execute on this in a 
matter of just minutes.  I apologize if I'm misunderstanding.  I'll start 
taking a look at the patches on the sub-tasks, and perhaps that will shed some 
further light.

Have you considered the workflow for doing a rolling upgrade on a paired 
primary and mirror?  I suspect there are some challenges here around 
coordinating the edit log roll and the edit log op codes to start or finalize a 
rolling upgrade.  It seems the mirror must have the new software version fully 
deployed first, before new edits related to new features start flowing from the 
primary.

Is the DataNode aware of its region ID?  If so, then the DataNode could check 
the region ID of the NameNode after registration to prevent misconfigurations, 
similar to what we do for block pool ID and cluster ID.


> Zero loss HDFS data replication for multiple datacenters
> --------------------------------------------------------
>
>                 Key: HDFS-5442
>                 URL: https://issues.apache.org/jira/browse/HDFS-5442
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Avik Dey
>            Assignee: Dian Fu
>         Attachments: Disaster Recovery Solution for Hadoop.pdf, Disaster 
> Recovery Solution for Hadoop.pdf
>
>
> Hadoop is architected to operate efficiently at scale for normal hardware 
> failures within a datacenter. Hadoop is not designed today to handle 
> datacenter failures. Although HDFS is not designed for nor deployed in 
> configurations spanning multiple datacenters, replicating data from one 
> location to another is common practice for disaster recovery and global 
> service availability. There are current solutions available for batch 
> replication using data copy/export tools. However, while providing some 
> backup capability for HDFS data, they do not provide the capability to 
> recover all your HDFS data from a datacenter failure and be up and running 
> again with a fully operational Hadoop cluster in another datacenter in a 
> matter of minutes. For disaster recovery from a datacenter failure, we should 
> provide a fully distributed, zero data loss, low latency, high throughput and 
> secure HDFS data replication solution for multiple datacenter setup.
> Design and code for Phase-1 to follow soon.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to