[ 
https://issues.apache.org/jira/browse/HDFS-9868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15888341#comment-15888341
 ] 

Yongjun Zhang edited comment on HDFS-9868 at 2/28/17 4:19 PM:
--------------------------------------------------------------

Here is my proposed approach to handle confMap with addCachedArchive:

distcp -confMapDir <xyz>

<xyz> is a local dir at the host where distcp is to be run. It contains
{code}
<xyz>/confMapping
<xyz>/<cluster1>/*xml
<xyz>/<cluster2>/*xml
<xyz>/<cluster3>/*xml
......
{code}

content of <xyz>/confMapping:
{code}
hdfs://x.y.z:8020 <cluster1>
hdfs//<cluster2-ha-service> <cluster2>
webhdfs://a.b.c:50070 <cluster3>
......
{code}
and <cluster1> is a dir that hold needed conf files for cluster1 
(hdfs://x.y.z:8020), <cluster2> is similar dir for cluster2 
(hdfs//<cluster2-ha-service>), and so on.

Distcp creates a tar file of dir <xyz> as <xyz>.tar (optionally this can be 
done manually, up to our final implementation, but if we do so, consistency 
between the original content and the tar file would be something to be 
concerned), then call
job.addcachedArchive(new URI("<xyz>.tar"));

CopyMapper/CopyCommitter would access the files in distributed cache as
{code}
./<xyz>.tar/confMapping
./<xyz>.tar/<cluster1>/*xml
./<xyz>.tar/<cluster2>/*xml
./<xyz>.tar/<cluster3>/*xml
......
{code}
and call Resource.addResource(Path path) to add the conf files.




was (Author: yzhangal):
Here is my proposed approach to handle confMap with addCachedArchive:

distcp -confMapDir <xyz>

<xyz> is a local dir at the host where distcp is to be run. It contains
{code}
<xyz>/confMapping
<xyz>/<cluster1>/*xml
<xyz>/<cluster2>/*xml
<xyz>/<cluster3>/*xml
......
{code}

content of <xyz>/confMapping:
{code}
hdfs://x.y.z:8020 <cluster1>
hdfs//<cluster2-ha-service> <cluster2>
webhdfs://a.b.c:50070 <cluster3>
......
{code}
and <cluster1> is a dir that hold needed conf files for cluster1 
(hdfs://x.y.z:8020), <cluster2> is similar dir for cluster2 
(hdfs//<cluster2-ha-service>), and so on.

Distcp creates a tar file of dir <xyz> as <xyz>.tar, then call
job.addcachedArchive(new URI("<xyz>.tar"));

CopyMapper/CopyCommitter would access the files in distributed cache as
{code}
./<xyz>.tar/confMapping
./<xyz>.tar/<cluster1>/*xml
./<xyz>.tar/<cluster2>/*xml
./<xyz>.tar/<cluster3>/*xml
......
{code}
and call Resource.addResource(Path path) to add the conf files.



> Add ability for DistCp to run between 2 clusters
> ------------------------------------------------
>
>                 Key: HDFS-9868
>                 URL: https://issues.apache.org/jira/browse/HDFS-9868
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: distcp
>    Affects Versions: 2.7.1
>            Reporter: NING DING
>            Assignee: NING DING
>         Attachments: HDFS-9868.05.patch, HDFS-9868.06.patch, 
> HDFS-9868.07.patch, HDFS-9868.08.patch, HDFS-9868.09.patch, 
> HDFS-9868.10.patch, HDFS-9868.1.patch, HDFS-9868.2.patch, HDFS-9868.3.patch, 
> HDFS-9868.4.patch
>
>
> Normally the HDFS cluster is HA enabled. It could take a long time when 
> coping huge data by distp. If the source cluster changes active namenode, the 
> distp will run failed. This patch supports the DistCp can read source cluster 
> files in HA access mode. A source cluster configuration file needs to be 
> specified (via the -sourceClusterConf option).
>   The following is an example of the contents of a source cluster 
> configuration
>   file:
> {code:xml}
>     <configuration>
>       <property>
>               <name>fs.defaultFS</name>
>               <value>hdfs://mycluster</value>
>         </property>
>         <property>
>               <name>dfs.nameservices</name>
>               <value>mycluster</value>
>         </property>
>         <property>
>               <name>dfs.ha.namenodes.mycluster</name>
>               <value>nn1,nn2</value>
>         </property>
>         <property>
>               <name>dfs.namenode.rpc-address.mycluster.nn1</name>
>               <value>host1:9000</value>
>         </property>
>         <property>
>               <name>dfs.namenode.rpc-address.mycluster.nn2</name>
>               <value>host2:9000</value>
>         </property>
>         <property>
>               <name>dfs.namenode.http-address.mycluster.nn1</name>
>               <value>host1:50070</value>
>         </property>
>         <property>
>               <name>dfs.namenode.http-address.mycluster.nn2</name>
>               <value>host2:50070</value>
>         </property>
>         <property>
>               <name>dfs.client.failover.proxy.provider.mycluster</name>
>               
> <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
>         </property>
>       </configuration>
> {code}
>   The invocation of DistCp is as below:
> {code}
>     bash$ hadoop distcp -sourceClusterConf sourceCluster.xml /foo/bar 
> hdfs://nn2:8020/bar/foo
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to