[ 
https://issues.apache.org/jira/browse/HDFS-9868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Chen updated HDFS-9868:
----------------------------
    Attachment: HDFS-9868.07.patch

Reviving this jira, attaching patch 7, which is after a new review of patch 6, 
with mostly cosmetic fixes.

The only limitation on the current approach is it has to be run from target 
cluster. Ideally we should support both source and target (i.e. distcp from 
either one, as long as you have the remote conf). But as commented above 
couldn't think of a way to generalized the 'remote' concept. We could add a 
{{-targetClusterConf}} later, and reuse most of the conf code, but let's focus 
on the current patch now.

[~yzhangal], could you please review? Thanks much.

> Add ability for DistCp to run between 2 clusters
> ------------------------------------------------
>
>                 Key: HDFS-9868
>                 URL: https://issues.apache.org/jira/browse/HDFS-9868
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: distcp
>    Affects Versions: 2.7.1
>            Reporter: NING DING
>            Assignee: NING DING
>         Attachments: HDFS-9868.05.patch, HDFS-9868.06.patch, 
> HDFS-9868.07.patch, HDFS-9868.1.patch, HDFS-9868.2.patch, HDFS-9868.3.patch, 
> HDFS-9868.4.patch
>
>
> Normally the HDFS cluster is HA enabled. It could take a long time when 
> coping huge data by distp. If the source cluster changes active namenode, the 
> distp will run failed. This patch supports the DistCp can read source cluster 
> files in HA access mode. A source cluster configuration file needs to be 
> specified (via the -sourceClusterConf option).
>   The following is an example of the contents of a source cluster 
> configuration
>   file:
> {code:xml}
>     <configuration>
>       <property>
>               <name>fs.defaultFS</name>
>               <value>hdfs://mycluster</value>
>         </property>
>         <property>
>               <name>dfs.nameservices</name>
>               <value>mycluster</value>
>         </property>
>         <property>
>               <name>dfs.ha.namenodes.mycluster</name>
>               <value>nn1,nn2</value>
>         </property>
>         <property>
>               <name>dfs.namenode.rpc-address.mycluster.nn1</name>
>               <value>host1:9000</value>
>         </property>
>         <property>
>               <name>dfs.namenode.rpc-address.mycluster.nn2</name>
>               <value>host2:9000</value>
>         </property>
>         <property>
>               <name>dfs.namenode.http-address.mycluster.nn1</name>
>               <value>host1:50070</value>
>         </property>
>         <property>
>               <name>dfs.namenode.http-address.mycluster.nn2</name>
>               <value>host2:50070</value>
>         </property>
>         <property>
>               <name>dfs.client.failover.proxy.provider.mycluster</name>
>               
> <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
>         </property>
>       </configuration>
> {code}
>   The invocation of DistCp is as below:
> {code}
>     bash$ hadoop distcp -sourceClusterConf sourceCluster.xml /foo/bar 
> hdfs://nn2:8020/bar/foo
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to