[ 
https://issues.apache.org/jira/browse/HDFS-9868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15888341#comment-15888341
 ] 

Yongjun Zhang commented on HDFS-9868:
-------------------------------------

Here is my proposed approach to handle confMap with addCachedArchive:

distcp -confMapDir <xyz>

<xyz> is a local dir at the host where distcp is to be run. It contains
{code}
<xyz>/confMapping
<xyz>/<cluster1>/*xml
<xyz>/<cluster2>/*xml
<xyz>/<cluster3>/*xml
......
{code}

content of <xyz>/confMapping:
{code}
hdfs://x.y.z:8020 <cluster1>
hdfs//<cluster2-ha-service> <cluster2>
webhdfs://a.b.c:50070 <cluster3>
......
{code}
and <cluster1> is a dir that hold needed conf files for cluster1 
(hdfs://x.y.z:8020), <cluster2> is similar dir for cluster2 
(hdfs//<cluster2-ha-service>), and so on.

Distcp creates a tar file of dir <xyz> as <xyz>.tar, then call
job.addcachedArchive(new URI("<xyz>.tar"));

CopyMapper/CopyCommitter would access the files in distributed cache as
{code}
./<xyz>.tar/confMapping
./<xyz>.tar/<cluster1>/*xml
./<xyz>.tar/<cluster2>/*xml
./<xyz>.tar/<cluster3>/*xml
......
{code}
and call Resource.addResource(Path path) to add the conf files.



> Add ability for DistCp to run between 2 clusters
> ------------------------------------------------
>
>                 Key: HDFS-9868
>                 URL: https://issues.apache.org/jira/browse/HDFS-9868
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: distcp
>    Affects Versions: 2.7.1
>            Reporter: NING DING
>            Assignee: NING DING
>         Attachments: HDFS-9868.05.patch, HDFS-9868.06.patch, 
> HDFS-9868.07.patch, HDFS-9868.08.patch, HDFS-9868.09.patch, 
> HDFS-9868.10.patch, HDFS-9868.1.patch, HDFS-9868.2.patch, HDFS-9868.3.patch, 
> HDFS-9868.4.patch
>
>
> Normally the HDFS cluster is HA enabled. It could take a long time when 
> coping huge data by distp. If the source cluster changes active namenode, the 
> distp will run failed. This patch supports the DistCp can read source cluster 
> files in HA access mode. A source cluster configuration file needs to be 
> specified (via the -sourceClusterConf option).
>   The following is an example of the contents of a source cluster 
> configuration
>   file:
> {code:xml}
>     <configuration>
>       <property>
>               <name>fs.defaultFS</name>
>               <value>hdfs://mycluster</value>
>         </property>
>         <property>
>               <name>dfs.nameservices</name>
>               <value>mycluster</value>
>         </property>
>         <property>
>               <name>dfs.ha.namenodes.mycluster</name>
>               <value>nn1,nn2</value>
>         </property>
>         <property>
>               <name>dfs.namenode.rpc-address.mycluster.nn1</name>
>               <value>host1:9000</value>
>         </property>
>         <property>
>               <name>dfs.namenode.rpc-address.mycluster.nn2</name>
>               <value>host2:9000</value>
>         </property>
>         <property>
>               <name>dfs.namenode.http-address.mycluster.nn1</name>
>               <value>host1:50070</value>
>         </property>
>         <property>
>               <name>dfs.namenode.http-address.mycluster.nn2</name>
>               <value>host2:50070</value>
>         </property>
>         <property>
>               <name>dfs.client.failover.proxy.provider.mycluster</name>
>               
> <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
>         </property>
>       </configuration>
> {code}
>   The invocation of DistCp is as below:
> {code}
>     bash$ hadoop distcp -sourceClusterConf sourceCluster.xml /foo/bar 
> hdfs://nn2:8020/bar/foo
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to