[jira] [Updated] (HDFS-9868) Add ability for DistCp to run between 2 clusters

Xiao Chen (JIRA) Sun, 26 Feb 2017 07:16:24 -0800

     [ 
https://issues.apache.org/jira/browse/HDFS-9868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Xiao Chen updated HDFS-9868:
----------------------------
    Attachment: HDFS-9868.10.patch

Thanks [~yzhangal] for the review. Tested on a real cluster (without kerberos 
for easier testing - don't expect kerberos to be a problem here though). 
In-progress patch 10 is a working solution that needs polishing, uploaded it to 
ease discussion. Comments are addressed except the following:

bq. 4. Instead of using distributed cache, suggest to use the same location as 
where sequence file is stored, to store the map file
Patch 10 uses the staging dir.

But using staging dir directly didn't work on real cluster. This 
{{Configuration}} 's resources, have to be local. Currently the confMap and its 
referring conf dirs are copied to the staging directory in HDFS, which cannot 
be loaded to a {{Configuration}}. Patch 10 goes with the approach to download 
it to a local dir on the mappers. (Need to polishing on finding the correct 
temp dir etc., but idea is as such)

An alternative to this is to use the distributed cache. Looked at its 
[docs|http://hadoop.apache.org/docs/r3.0.0-alpha2/hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistributedCacheDeploy.html]
 but it's not clear to me:
- what's the best way to directories of files to it.
- whether those files will be added to classpath and how to disable that.

Personally I feel copy from staging dir to local maybe safer, but appreciate 
any suggestions.

bq. 2. Better check whether t is null below
Not necessary. See 
https://docs.oracle.com/javase/tutorial/java/nutsandbolts/op2.html

bq. 6. In CopyMapper, the confMap is set but not used. We should apply it when 
getting the source and target file system.
It is used to override the CopyListingStatus's getFileSystem.
Seems we should make it work with both sourcee and target. This requires more 
changes, will try to get to it in the next rev.

TODOs:
- #6 above
- test cleanup
- doc update with final version

> Add ability for DistCp to run between 2 clusters
> ------------------------------------------------
>
>                 Key: HDFS-9868
>                 URL: https://issues.apache.org/jira/browse/HDFS-9868
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: distcp
>    Affects Versions: 2.7.1
>            Reporter: NING DING
>            Assignee: NING DING
>         Attachments: HDFS-9868.05.patch, HDFS-9868.06.patch, 
> HDFS-9868.07.patch, HDFS-9868.08.patch, HDFS-9868.09.patch, 
> HDFS-9868.10.patch, HDFS-9868.1.patch, HDFS-9868.2.patch, HDFS-9868.3.patch, 
> HDFS-9868.4.patch
>
>
> Normally the HDFS cluster is HA enabled. It could take a long time when 
> coping huge data by distp. If the source cluster changes active namenode, the 
> distp will run failed. This patch supports the DistCp can read source cluster 
> files in HA access mode. A source cluster configuration file needs to be 
> specified (via the -sourceClusterConf option).
>   The following is an example of the contents of a source cluster 
> configuration
>   file:
> {code:xml}
>     <configuration>
>       <property>
>               <name>fs.defaultFS</name>
>               <value>hdfs://mycluster</value>
>         </property>
>         <property>
>               <name>dfs.nameservices</name>
>               <value>mycluster</value>
>         </property>
>         <property>
>               <name>dfs.ha.namenodes.mycluster</name>
>               <value>nn1,nn2</value>
>         </property>
>         <property>
>               <name>dfs.namenode.rpc-address.mycluster.nn1</name>
>               <value>host1:9000</value>
>         </property>
>         <property>
>               <name>dfs.namenode.rpc-address.mycluster.nn2</name>
>               <value>host2:9000</value>
>         </property>
>         <property>
>               <name>dfs.namenode.http-address.mycluster.nn1</name>
>               <value>host1:50070</value>
>         </property>
>         <property>
>               <name>dfs.namenode.http-address.mycluster.nn2</name>
>               <value>host2:50070</value>
>         </property>
>         <property>
>               <name>dfs.client.failover.proxy.provider.mycluster</name>
>               
> <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
>         </property>
>       </configuration>
> {code}
>   The invocation of DistCp is as below:
> {code}
>     bash$ hadoop distcp -sourceClusterConf sourceCluster.xml /foo/bar 
> hdfs://nn2:8020/bar/foo
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDFS-9868) Add ability for DistCp to run between 2 clusters

Reply via email to