[
https://issues.apache.org/jira/browse/HDFS-9868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xiao Chen updated HDFS-9868:
----------------------------
Attachment: HDFS-9868.10.patch
Thanks [~yzhangal] for the review. Tested on a real cluster (without kerberos
for easier testing - don't expect kerberos to be a problem here though).
In-progress patch 10 is a working solution that needs polishing, uploaded it to
ease discussion. Comments are addressed except the following:
bq. 4. Instead of using distributed cache, suggest to use the same location as
where sequence file is stored, to store the map file
Patch 10 uses the staging dir.
But using staging dir directly didn't work on real cluster. This
{{Configuration}} 's resources, have to be local. Currently the confMap and its
referring conf dirs are copied to the staging directory in HDFS, which cannot
be loaded to a {{Configuration}}. Patch 10 goes with the approach to download
it to a local dir on the mappers. (Need to polishing on finding the correct
temp dir etc., but idea is as such)
An alternative to this is to use the distributed cache. Looked at its
[docs|http://hadoop.apache.org/docs/r3.0.0-alpha2/hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistributedCacheDeploy.html]
but it's not clear to me:
- what's the best way to directories of files to it.
- whether those files will be added to classpath and how to disable that.
Personally I feel copy from staging dir to local maybe safer, but appreciate
any suggestions.
bq. 2. Better check whether t is null below
Not necessary. See
https://docs.oracle.com/javase/tutorial/java/nutsandbolts/op2.html
bq. 6. In CopyMapper, the confMap is set but not used. We should apply it when
getting the source and target file system.
It is used to override the CopyListingStatus's getFileSystem.
Seems we should make it work with both sourcee and target. This requires more
changes, will try to get to it in the next rev.
TODOs:
- #6 above
- test cleanup
- doc update with final version
> Add ability for DistCp to run between 2 clusters
> ------------------------------------------------
>
> Key: HDFS-9868
> URL: https://issues.apache.org/jira/browse/HDFS-9868
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: distcp
> Affects Versions: 2.7.1
> Reporter: NING DING
> Assignee: NING DING
> Attachments: HDFS-9868.05.patch, HDFS-9868.06.patch,
> HDFS-9868.07.patch, HDFS-9868.08.patch, HDFS-9868.09.patch,
> HDFS-9868.10.patch, HDFS-9868.1.patch, HDFS-9868.2.patch, HDFS-9868.3.patch,
> HDFS-9868.4.patch
>
>
> Normally the HDFS cluster is HA enabled. It could take a long time when
> coping huge data by distp. If the source cluster changes active namenode, the
> distp will run failed. This patch supports the DistCp can read source cluster
> files in HA access mode. A source cluster configuration file needs to be
> specified (via the -sourceClusterConf option).
> The following is an example of the contents of a source cluster
> configuration
> file:
> {code:xml}
> <configuration>
> <property>
> <name>fs.defaultFS</name>
> <value>hdfs://mycluster</value>
> </property>
> <property>
> <name>dfs.nameservices</name>
> <value>mycluster</value>
> </property>
> <property>
> <name>dfs.ha.namenodes.mycluster</name>
> <value>nn1,nn2</value>
> </property>
> <property>
> <name>dfs.namenode.rpc-address.mycluster.nn1</name>
> <value>host1:9000</value>
> </property>
> <property>
> <name>dfs.namenode.rpc-address.mycluster.nn2</name>
> <value>host2:9000</value>
> </property>
> <property>
> <name>dfs.namenode.http-address.mycluster.nn1</name>
> <value>host1:50070</value>
> </property>
> <property>
> <name>dfs.namenode.http-address.mycluster.nn2</name>
> <value>host2:50070</value>
> </property>
> <property>
> <name>dfs.client.failover.proxy.provider.mycluster</name>
>
> <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
> </property>
> </configuration>
> {code}
> The invocation of DistCp is as below:
> {code}
> bash$ hadoop distcp -sourceClusterConf sourceCluster.xml /foo/bar
> hdfs://nn2:8020/bar/foo
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]