[ 
https://issues.apache.org/jira/browse/HDFS-9868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15886076#comment-15886076
 ] 

Yongjun Zhang edited comment on HDFS-9868 at 2/27/17 4:35 PM:
--------------------------------------------------------------

Thanks [~xiaochen] for continuing the effort.

Looking further, I think using DistributedCache is better and safer, the trick 
is how to manage the different conf dirs passed to DistributedCache.

I think one possible solution is (this need to be well documented if we decide 
to go with this approach)
1. (user) make a copy of each conf dir, put them at a central location (such as 
where we kick off DistCp) that's accessible by DistCp, 
2. Each conf dir is required to have simple names, such as cluster1conf, 
cluster2conf
3.Then we can flatten the names as distcp_cluster1conf1 etc (include a prefix 
"distcp_" to be safer) when putting to distributed cache when running distcp
4. The confMap file entry is: 
cluster1 cluster1conf
cluster2 cluster2conf
...
5. Then with the DistributedCache API, we can get these files and pass them to 
Configuration.addResource APIs.

NOTE. DistributedCache API is obsoleted, they are moved to Job.



was (Author: yzhangal):
Thanks [~xiaochen] for continuing the effort.

Looking further, I think using DistributedCache is better and safer, the trick 
is how to manage the different conf dirs passed to DistributedCache.

I think one possible solution is (this need to be documented)
1. (user) make a copy of each conf dir, put them at a central location (such as 
where we kick off DistCp) that's accessible by DistCp, 
2. Each conf dir is required to have simple names, such as cluster1conf, 
cluster2conf
3.Then we can flatten the names as distcp_cluster1conf1 etc (include a prefix 
"distcp_" to be safer) when putting to distributed cache when running distcp
4. The confMap file entry is: 
cluster1 cluster1conf
cluster2 cluster2conf
...
5. Then with the DistributedCache API, we can get these files and pass them to 
Configuration.addResource APIs.

NOTE. DistributedCache API is obsoleted, they are moved to Job.


> Add ability for DistCp to run between 2 clusters
> ------------------------------------------------
>
>                 Key: HDFS-9868
>                 URL: https://issues.apache.org/jira/browse/HDFS-9868
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: distcp
>    Affects Versions: 2.7.1
>            Reporter: NING DING
>            Assignee: NING DING
>         Attachments: HDFS-9868.05.patch, HDFS-9868.06.patch, 
> HDFS-9868.07.patch, HDFS-9868.08.patch, HDFS-9868.09.patch, 
> HDFS-9868.10.patch, HDFS-9868.1.patch, HDFS-9868.2.patch, HDFS-9868.3.patch, 
> HDFS-9868.4.patch
>
>
> Normally the HDFS cluster is HA enabled. It could take a long time when 
> coping huge data by distp. If the source cluster changes active namenode, the 
> distp will run failed. This patch supports the DistCp can read source cluster 
> files in HA access mode. A source cluster configuration file needs to be 
> specified (via the -sourceClusterConf option).
>   The following is an example of the contents of a source cluster 
> configuration
>   file:
> {code:xml}
>     <configuration>
>       <property>
>               <name>fs.defaultFS</name>
>               <value>hdfs://mycluster</value>
>         </property>
>         <property>
>               <name>dfs.nameservices</name>
>               <value>mycluster</value>
>         </property>
>         <property>
>               <name>dfs.ha.namenodes.mycluster</name>
>               <value>nn1,nn2</value>
>         </property>
>         <property>
>               <name>dfs.namenode.rpc-address.mycluster.nn1</name>
>               <value>host1:9000</value>
>         </property>
>         <property>
>               <name>dfs.namenode.rpc-address.mycluster.nn2</name>
>               <value>host2:9000</value>
>         </property>
>         <property>
>               <name>dfs.namenode.http-address.mycluster.nn1</name>
>               <value>host1:50070</value>
>         </property>
>         <property>
>               <name>dfs.namenode.http-address.mycluster.nn2</name>
>               <value>host2:50070</value>
>         </property>
>         <property>
>               <name>dfs.client.failover.proxy.provider.mycluster</name>
>               
> <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
>         </property>
>       </configuration>
> {code}
>   The invocation of DistCp is as below:
> {code}
>     bash$ hadoop distcp -sourceClusterConf sourceCluster.xml /foo/bar 
> hdfs://nn2:8020/bar/foo
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to