He Tianyi created HDFS-9075:
-------------------------------
Summary: Multiple datacenter replication inside one HDFS cluster
Key: HDFS-9075
URL: https://issues.apache.org/jira/browse/HDFS-9075
Project: Hadoop HDFS
Issue Type: New Feature
Components: datanode, namenode
Reporter: He Tianyi
Assignee: He Tianyi
It is common scenario for deploying multiple datacenter for scaling and
disaster tolerant.
In this case we certainly want that data can be shared transparently (to user)
across datacenters.
For example, say we have a raw user action log stored daily, different
computations may take place with the log as input. As scale grows, we may want
to schedule various kind of computations in more than one datacenter.
As far as i know, current solution is to deploy multiple clusters corresponding
to datacenters, using {{distcp}} to sync data between them.
But in this case, user needs to know exactly where data is stored, and mistakes
may be made during human-intervened operations. After all, it is basically a
computer job.
Based on these facts, it is obvious that a multiple datacenter replication
solution may solve the scenario.
I am working one prototype that works with 2 datacenters, the goal is to
provide data replication between datacenters transparently and minimize the
inter-dc bandwidth usage. Basic idea is replicate blocks to both DC and
determine number of replications by historical statistics of access behaviors
of that part of namespace.
I will post a design document soon.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)