[jira] [Updated] (HDFS-9075) Multiple datacenter replication inside one HDFS cluster

He Tianyi (JIRA) Mon, 14 Sep 2015 01:21:08 -0700

     [ 
https://issues.apache.org/jira/browse/HDFS-9075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


He Tianyi updated HDFS-9075:
----------------------------
    Description: 
It is common scenario of deploying multiple datacenter for scaling and disaster 
tolerant. 
In this case we certainly want that data can be shared transparently (to user) 
across datacenters.

For example, say we have a raw user action log stored daily, different 
computations may take place with the log as input. As scale grows, we may want 
to schedule various kind of computations in more than one datacenter.

As far as i know, current solution is to deploy multiple independent clusters 
corresponding to datacenters, using {{distcp}} to sync data files between them.
But in this case, user needs to know exactly where data is stored, and mistakes 
may be made during human-intervened operations. After all, it is basically a 
computer job.

Based on these facts, it is obvious that a multiple datacenter replication 
solution may solve the scenario.

I am working one prototype that works with 2 datacenters, the goal is to 
provide data replication between datacenters transparently and minimize the 
inter-dc bandwidth usage. Basic idea is replicate blocks to both DC and 
determine number of replications by historical statistics of access behaviors 
of that part of namespace.

I will post a design document soon.

  was:
It is common scenario for deploying multiple datacenter for scaling and 
disaster tolerant. 
In this case we certainly want that data can be shared transparently (to user) 
across datacenters.

For example, say we have a raw user action log stored daily, different 
computations may take place with the log as input. As scale grows, we may want 
to schedule various kind of computations in more than one datacenter.

As far as i know, current solution is to deploy multiple clusters corresponding 
to datacenters, using {{distcp}} to sync data between them.
But in this case, user needs to know exactly where data is stored, and mistakes 
may be made during human-intervened operations. After all, it is basically a 
computer job.

Based on these facts, it is obvious that a multiple datacenter replication 
solution may solve the scenario.

I am working one prototype that works with 2 datacenters, the goal is to 
provide data replication between datacenters transparently and minimize the 
inter-dc bandwidth usage. Basic idea is replicate blocks to both DC and 
determine number of replications by historical statistics of access behaviors 
of that part of namespace.

I will post a design document soon.


> Multiple datacenter replication inside one HDFS cluster
> -------------------------------------------------------
>
>                 Key: HDFS-9075
>                 URL: https://issues.apache.org/jira/browse/HDFS-9075
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode, namenode
>            Reporter: He Tianyi
>            Assignee: He Tianyi
>
> It is common scenario of deploying multiple datacenter for scaling and 
> disaster tolerant. 
> In this case we certainly want that data can be shared transparently (to 
> user) across datacenters.
> For example, say we have a raw user action log stored daily, different 
> computations may take place with the log as input. As scale grows, we may 
> want to schedule various kind of computations in more than one datacenter.
> As far as i know, current solution is to deploy multiple independent clusters 
> corresponding to datacenters, using {{distcp}} to sync data files between 
> them.
> But in this case, user needs to know exactly where data is stored, and 
> mistakes may be made during human-intervened operations. After all, it is 
> basically a computer job.
> Based on these facts, it is obvious that a multiple datacenter replication 
> solution may solve the scenario.
> I am working one prototype that works with 2 datacenters, the goal is to 
> provide data replication between datacenters transparently and minimize the 
> inter-dc bandwidth usage. Basic idea is replicate blocks to both DC and 
> determine number of replications by historical statistics of access behaviors 
> of that part of namespace.
> I will post a design document soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-9075) Multiple datacenter replication inside one HDFS cluster

Reply via email to