[
https://issues.apache.org/jira/browse/HDFS-9075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14746060#comment-14746060
]
Chris Nauroth commented on HDFS-9075:
-------------------------------------
HDFS-1432 and HDFS-5442 are prior proposals for multiple data center support.
Both appear to be inactive or abandoned at this point.
> Multiple datacenter replication inside one HDFS cluster
> -------------------------------------------------------
>
> Key: HDFS-9075
> URL: https://issues.apache.org/jira/browse/HDFS-9075
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: datanode, namenode
> Reporter: He Tianyi
> Assignee: He Tianyi
>
> It is common scenario of deploying multiple datacenter for scaling and
> disaster tolerant.
> In this case we certainly want that data can be shared transparently (to
> user) across datacenters.
> For example, say we have a raw user action log stored daily, different
> computations may take place with the log as input. As scale grows, we may
> want to schedule various kind of computations in more than one datacenter.
> As far as i know, current solution is to deploy multiple independent clusters
> corresponding to datacenters, using {{distcp}} to sync data files between
> them.
> But in this case, user needs to know exactly where data is stored, and
> mistakes may be made during human-intervened operations. After all, it is
> basically a computer job.
> Based on these facts, it is obvious that a multiple datacenter replication
> solution may solve the scenario.
> I am working one prototype that works with 2 datacenters, the goal is to
> provide data replication between datacenters transparently and minimize the
> inter-dc bandwidth usage. Basic idea is replicate blocks to both DC and
> determine number of replications by historical statistics of access behaviors
> of that part of namespace.
> I will post a design document soon.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)