[ 
https://issues.apache.org/jira/browse/HDFS-5442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246862#comment-14246862
 ] 

Hari Sekhon commented on HDFS-5442:
-----------------------------------

Zero-loss is practically impossible unless you do synchronous high latency 
writes to both sites, so neither WANdisco or MapR can claim zero-loss while 
still performing well, and I've had significant unmanageable streaming async 
block replication lag greater than several dozen minutes (ie significant 
potential data loss) when using a well known HDFS proprietary add-on...

With atomic snapshot mirroring mode you will at least know what you have 
consistent to what point in time and can work with that, rather than having to 
fsck to find out what data has random holes in it from blocks that haven't made 
it across the random low priority replication.

For option 1 it would be better if block write ordering could be maintained and 
replayed at the other site in the same order for chronological consistency up 
to the latest DR checkpoint in case any non-trivial application sitting on top 
of the filesystem isn't prepared for having holes in it's data.. eg for WAL 
logs or distributed SQL DBs redo logs sitting on top of HDFS (some solutions 
might do their own replication in which case that should be excluded via my 
previously mentioned configurable path exclusions).

The final thing that HDFS DR should have is administrative active foreground 
block repair for off-peak times to catch up faster by maxing out the bandwidth 
(or the max bandwidth settings you've specified).

Ultimately both option 1 and option 2 should be provided since each is better 
for different use cases. Option 2 has been done very well by MapR, Option 1 
hasn't been done perfectly by anyone I've see yet but I'm very eager for this 
to be done (anyone at Hortonworks reading this??? ;)  ).

> Zero loss HDFS data replication for multiple datacenters
> --------------------------------------------------------
>
>                 Key: HDFS-5442
>                 URL: https://issues.apache.org/jira/browse/HDFS-5442
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Avik Dey
>            Assignee: Dian Fu
>         Attachments: Disaster Recovery Solution for Hadoop.pdf, Disaster 
> Recovery Solution for Hadoop.pdf, Disaster Recovery Solution for Hadoop.pdf
>
>
> Hadoop is architected to operate efficiently at scale for normal hardware 
> failures within a datacenter. Hadoop is not designed today to handle 
> datacenter failures. Although HDFS is not designed for nor deployed in 
> configurations spanning multiple datacenters, replicating data from one 
> location to another is common practice for disaster recovery and global 
> service availability. There are current solutions available for batch 
> replication using data copy/export tools. However, while providing some 
> backup capability for HDFS data, they do not provide the capability to 
> recover all your HDFS data from a datacenter failure and be up and running 
> again with a fully operational Hadoop cluster in another datacenter in a 
> matter of minutes. For disaster recovery from a datacenter failure, we should 
> provide a fully distributed, zero data loss, low latency, high throughput and 
> secure HDFS data replication solution for multiple datacenter setup.
> Design and code for Phase-1 to follow soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to