[
https://issues.apache.org/jira/browse/HDFS-9868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15888341#comment-15888341
]
Yongjun Zhang commented on HDFS-9868:
-------------------------------------
Here is my proposed approach to handle confMap with addCachedArchive:
distcp -confMapDir <xyz>
<xyz> is a local dir at the host where distcp is to be run. It contains
{code}
<xyz>/confMapping
<xyz>/<cluster1>/*xml
<xyz>/<cluster2>/*xml
<xyz>/<cluster3>/*xml
......
{code}
content of <xyz>/confMapping:
{code}
hdfs://x.y.z:8020 <cluster1>
hdfs//<cluster2-ha-service> <cluster2>
webhdfs://a.b.c:50070 <cluster3>
......
{code}
and <cluster1> is a dir that hold needed conf files for cluster1
(hdfs://x.y.z:8020), <cluster2> is similar dir for cluster2
(hdfs//<cluster2-ha-service>), and so on.
Distcp creates a tar file of dir <xyz> as <xyz>.tar, then call
job.addcachedArchive(new URI("<xyz>.tar"));
CopyMapper/CopyCommitter would access the files in distributed cache as
{code}
./<xyz>.tar/confMapping
./<xyz>.tar/<cluster1>/*xml
./<xyz>.tar/<cluster2>/*xml
./<xyz>.tar/<cluster3>/*xml
......
{code}
and call Resource.addResource(Path path) to add the conf files.
> Add ability for DistCp to run between 2 clusters
> ------------------------------------------------
>
> Key: HDFS-9868
> URL: https://issues.apache.org/jira/browse/HDFS-9868
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: distcp
> Affects Versions: 2.7.1
> Reporter: NING DING
> Assignee: NING DING
> Attachments: HDFS-9868.05.patch, HDFS-9868.06.patch,
> HDFS-9868.07.patch, HDFS-9868.08.patch, HDFS-9868.09.patch,
> HDFS-9868.10.patch, HDFS-9868.1.patch, HDFS-9868.2.patch, HDFS-9868.3.patch,
> HDFS-9868.4.patch
>
>
> Normally the HDFS cluster is HA enabled. It could take a long time when
> coping huge data by distp. If the source cluster changes active namenode, the
> distp will run failed. This patch supports the DistCp can read source cluster
> files in HA access mode. A source cluster configuration file needs to be
> specified (via the -sourceClusterConf option).
> The following is an example of the contents of a source cluster
> configuration
> file:
> {code:xml}
> <configuration>
> <property>
> <name>fs.defaultFS</name>
> <value>hdfs://mycluster</value>
> </property>
> <property>
> <name>dfs.nameservices</name>
> <value>mycluster</value>
> </property>
> <property>
> <name>dfs.ha.namenodes.mycluster</name>
> <value>nn1,nn2</value>
> </property>
> <property>
> <name>dfs.namenode.rpc-address.mycluster.nn1</name>
> <value>host1:9000</value>
> </property>
> <property>
> <name>dfs.namenode.rpc-address.mycluster.nn2</name>
> <value>host2:9000</value>
> </property>
> <property>
> <name>dfs.namenode.http-address.mycluster.nn1</name>
> <value>host1:50070</value>
> </property>
> <property>
> <name>dfs.namenode.http-address.mycluster.nn2</name>
> <value>host2:50070</value>
> </property>
> <property>
> <name>dfs.client.failover.proxy.provider.mycluster</name>
>
> <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
> </property>
> </configuration>
> {code}
> The invocation of DistCp is as below:
> {code}
> bash$ hadoop distcp -sourceClusterConf sourceCluster.xml /foo/bar
> hdfs://nn2:8020/bar/foo
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]