[jira] [Updated] (HDFS-9868) Add ability for DistCp to run between 2 clusters

2017-02-26 Thread Xiao Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Chen updated HDFS-9868:

Attachment: HDFS-9868.10.patch

Thanks [~yzhangal] for the review. Tested on a real cluster (without kerberos 
for easier testing - don't expect kerberos to be a problem here though). 
In-progress patch 10 is a working solution that needs polishing, uploaded it to 
ease discussion. Comments are addressed except the following:

bq. 4. Instead of using distributed cache, suggest to use the same location as 
where sequence file is stored, to store the map file
Patch 10 uses the staging dir.

But using staging dir directly didn't work on real cluster. This 
{{Configuration}} 's resources, have to be local. Currently the confMap and its 
referring conf dirs are copied to the staging directory in HDFS, which cannot 
be loaded to a {{Configuration}}. Patch 10 goes with the approach to download 
it to a local dir on the mappers. (Need to polishing on finding the correct 
temp dir etc., but idea is as such)

An alternative to this is to use the distributed cache. Looked at its 
[docs|http://hadoop.apache.org/docs/r3.0.0-alpha2/hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistributedCacheDeploy.html]
 but it's not clear to me:
- what's the best way to directories of files to it.
- whether those files will be added to classpath and how to disable that.

Personally I feel copy from staging dir to local maybe safer, but appreciate 
any suggestions.

bq. 2. Better check whether t is null below
Not necessary. See 
https://docs.oracle.com/javase/tutorial/java/nutsandbolts/op2.html

bq. 6. In CopyMapper, the confMap is set but not used. We should apply it when 
getting the source and target file system.
It is used to override the CopyListingStatus's getFileSystem.
Seems we should make it work with both sourcee and target. This requires more 
changes, will try to get to it in the next rev.

TODOs:
- #6 above
- test cleanup
- doc update with final version

> Add ability for DistCp to run between 2 clusters
> 
>
> Key: HDFS-9868
> URL: https://issues.apache.org/jira/browse/HDFS-9868
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: distcp
>Affects Versions: 2.7.1
>Reporter: NING DING
>Assignee: NING DING
> Attachments: HDFS-9868.05.patch, HDFS-9868.06.patch, 
> HDFS-9868.07.patch, HDFS-9868.08.patch, HDFS-9868.09.patch, 
> HDFS-9868.10.patch, HDFS-9868.1.patch, HDFS-9868.2.patch, HDFS-9868.3.patch, 
> HDFS-9868.4.patch
>
>
> Normally the HDFS cluster is HA enabled. It could take a long time when 
> coping huge data by distp. If the source cluster changes active namenode, the 
> distp will run failed. This patch supports the DistCp can read source cluster 
> files in HA access mode. A source cluster configuration file needs to be 
> specified (via the -sourceClusterConf option).
>   The following is an example of the contents of a source cluster 
> configuration
>   file:
> {code:xml}
> 
>   
>   fs.defaultFS
>   hdfs://mycluster
> 
> 
>   dfs.nameservices
>   mycluster
> 
> 
>   dfs.ha.namenodes.mycluster
>   nn1,nn2
> 
> 
>   dfs.namenode.rpc-address.mycluster.nn1
>   host1:9000
> 
> 
>   dfs.namenode.rpc-address.mycluster.nn2
>   host2:9000
> 
> 
>   dfs.namenode.http-address.mycluster.nn1
>   host1:50070
> 
> 
>   dfs.namenode.http-address.mycluster.nn2
>   host2:50070
> 
> 
>   dfs.client.failover.proxy.provider.mycluster
>   
> org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
> 
>   
> {code}
>   The invocation of DistCp is as below:
> {code}
> bash$ hadoop distcp -sourceClusterConf sourceCluster.xml /foo/bar 
> hdfs://nn2:8020/bar/foo
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-9868) Add ability for DistCp to run between 2 clusters

2017-02-24 Thread Xiao Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Chen updated HDFS-9868:

Attachment: HDFS-9868.09.patch

Patch 9 to address most of the comments.

Kept #5 unchanged because I can't think of a good generalized way to add a 
field to {{Path}}, which is Public Stable. Adding a 'remote cluster path' seems 
to specific to distcp... Better alternatives welcome.

I'm trying to test it on a (non-kerberized) real cluster. Thanks.

> Add ability for DistCp to run between 2 clusters
> 
>
> Key: HDFS-9868
> URL: https://issues.apache.org/jira/browse/HDFS-9868
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: distcp
>Affects Versions: 2.7.1
>Reporter: NING DING
>Assignee: NING DING
> Attachments: HDFS-9868.05.patch, HDFS-9868.06.patch, 
> HDFS-9868.07.patch, HDFS-9868.08.patch, HDFS-9868.09.patch, 
> HDFS-9868.1.patch, HDFS-9868.2.patch, HDFS-9868.3.patch, HDFS-9868.4.patch
>
>
> Normally the HDFS cluster is HA enabled. It could take a long time when 
> coping huge data by distp. If the source cluster changes active namenode, the 
> distp will run failed. This patch supports the DistCp can read source cluster 
> files in HA access mode. A source cluster configuration file needs to be 
> specified (via the -sourceClusterConf option).
>   The following is an example of the contents of a source cluster 
> configuration
>   file:
> {code:xml}
> 
>   
>   fs.defaultFS
>   hdfs://mycluster
> 
> 
>   dfs.nameservices
>   mycluster
> 
> 
>   dfs.ha.namenodes.mycluster
>   nn1,nn2
> 
> 
>   dfs.namenode.rpc-address.mycluster.nn1
>   host1:9000
> 
> 
>   dfs.namenode.rpc-address.mycluster.nn2
>   host2:9000
> 
> 
>   dfs.namenode.http-address.mycluster.nn1
>   host1:50070
> 
> 
>   dfs.namenode.http-address.mycluster.nn2
>   host2:50070
> 
> 
>   dfs.client.failover.proxy.provider.mycluster
>   
> org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
> 
>   
> {code}
>   The invocation of DistCp is as below:
> {code}
> bash$ hadoop distcp -sourceClusterConf sourceCluster.xml /foo/bar 
> hdfs://nn2:8020/bar/foo
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-9868) Add ability for DistCp to run between 2 clusters

2017-02-23 Thread Xiao Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Chen updated HDFS-9868:

Attachment: HDFS-9868.08.patch

Thanks [~yzhangal] for the comment, I think it's a good idea.
Patch 8 attached to reflect it.

> Add ability for DistCp to run between 2 clusters
> 
>
> Key: HDFS-9868
> URL: https://issues.apache.org/jira/browse/HDFS-9868
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: distcp
>Affects Versions: 2.7.1
>Reporter: NING DING
>Assignee: NING DING
> Attachments: HDFS-9868.05.patch, HDFS-9868.06.patch, 
> HDFS-9868.07.patch, HDFS-9868.08.patch, HDFS-9868.1.patch, HDFS-9868.2.patch, 
> HDFS-9868.3.patch, HDFS-9868.4.patch
>
>
> Normally the HDFS cluster is HA enabled. It could take a long time when 
> coping huge data by distp. If the source cluster changes active namenode, the 
> distp will run failed. This patch supports the DistCp can read source cluster 
> files in HA access mode. A source cluster configuration file needs to be 
> specified (via the -sourceClusterConf option).
>   The following is an example of the contents of a source cluster 
> configuration
>   file:
> {code:xml}
> 
>   
>   fs.defaultFS
>   hdfs://mycluster
> 
> 
>   dfs.nameservices
>   mycluster
> 
> 
>   dfs.ha.namenodes.mycluster
>   nn1,nn2
> 
> 
>   dfs.namenode.rpc-address.mycluster.nn1
>   host1:9000
> 
> 
>   dfs.namenode.rpc-address.mycluster.nn2
>   host2:9000
> 
> 
>   dfs.namenode.http-address.mycluster.nn1
>   host1:50070
> 
> 
>   dfs.namenode.http-address.mycluster.nn2
>   host2:50070
> 
> 
>   dfs.client.failover.proxy.provider.mycluster
>   
> org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
> 
>   
> {code}
>   The invocation of DistCp is as below:
> {code}
> bash$ hadoop distcp -sourceClusterConf sourceCluster.xml /foo/bar 
> hdfs://nn2:8020/bar/foo
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-9868) Add ability for DistCp to run between 2 clusters

2017-02-16 Thread Xiao Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Chen updated HDFS-9868:

Issue Type: Improvement  (was: New Feature)

Updating jira type to improvement since this is just adding a new option to 
distcp command.

> Add ability for DistCp to run between 2 clusters
> 
>
> Key: HDFS-9868
> URL: https://issues.apache.org/jira/browse/HDFS-9868
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: distcp
>Affects Versions: 2.7.1
>Reporter: NING DING
>Assignee: NING DING
> Attachments: HDFS-9868.05.patch, HDFS-9868.06.patch, 
> HDFS-9868.07.patch, HDFS-9868.1.patch, HDFS-9868.2.patch, HDFS-9868.3.patch, 
> HDFS-9868.4.patch
>
>
> Normally the HDFS cluster is HA enabled. It could take a long time when 
> coping huge data by distp. If the source cluster changes active namenode, the 
> distp will run failed. This patch supports the DistCp can read source cluster 
> files in HA access mode. A source cluster configuration file needs to be 
> specified (via the -sourceClusterConf option).
>   The following is an example of the contents of a source cluster 
> configuration
>   file:
> {code:xml}
> 
>   
>   fs.defaultFS
>   hdfs://mycluster
> 
> 
>   dfs.nameservices
>   mycluster
> 
> 
>   dfs.ha.namenodes.mycluster
>   nn1,nn2
> 
> 
>   dfs.namenode.rpc-address.mycluster.nn1
>   host1:9000
> 
> 
>   dfs.namenode.rpc-address.mycluster.nn2
>   host2:9000
> 
> 
>   dfs.namenode.http-address.mycluster.nn1
>   host1:50070
> 
> 
>   dfs.namenode.http-address.mycluster.nn2
>   host2:50070
> 
> 
>   dfs.client.failover.proxy.provider.mycluster
>   
> org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
> 
>   
> {code}
>   The invocation of DistCp is as below:
> {code}
> bash$ hadoop distcp -sourceClusterConf sourceCluster.xml /foo/bar 
> hdfs://nn2:8020/bar/foo
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-9868) Add ability for DistCp to run between 2 clusters

2017-02-16 Thread Xiao Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Chen updated HDFS-9868:

Attachment: HDFS-9868.07.patch

> Add ability for DistCp to run between 2 clusters
> 
>
> Key: HDFS-9868
> URL: https://issues.apache.org/jira/browse/HDFS-9868
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: distcp
>Affects Versions: 2.7.1
>Reporter: NING DING
>Assignee: NING DING
> Attachments: HDFS-9868.05.patch, HDFS-9868.06.patch, 
> HDFS-9868.07.patch, HDFS-9868.1.patch, HDFS-9868.2.patch, HDFS-9868.3.patch, 
> HDFS-9868.4.patch
>
>
> Normally the HDFS cluster is HA enabled. It could take a long time when 
> coping huge data by distp. If the source cluster changes active namenode, the 
> distp will run failed. This patch supports the DistCp can read source cluster 
> files in HA access mode. A source cluster configuration file needs to be 
> specified (via the -sourceClusterConf option).
>   The following is an example of the contents of a source cluster 
> configuration
>   file:
> {code:xml}
> 
>   
>   fs.defaultFS
>   hdfs://mycluster
> 
> 
>   dfs.nameservices
>   mycluster
> 
> 
>   dfs.ha.namenodes.mycluster
>   nn1,nn2
> 
> 
>   dfs.namenode.rpc-address.mycluster.nn1
>   host1:9000
> 
> 
>   dfs.namenode.rpc-address.mycluster.nn2
>   host2:9000
> 
> 
>   dfs.namenode.http-address.mycluster.nn1
>   host1:50070
> 
> 
>   dfs.namenode.http-address.mycluster.nn2
>   host2:50070
> 
> 
>   dfs.client.failover.proxy.provider.mycluster
>   
> org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
> 
>   
> {code}
>   The invocation of DistCp is as below:
> {code}
> bash$ hadoop distcp -sourceClusterConf sourceCluster.xml /foo/bar 
> hdfs://nn2:8020/bar/foo
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-9868) Add ability for DistCp to run between 2 clusters

2017-02-16 Thread Xiao Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Chen updated HDFS-9868:

Attachment: (was: HDFS-9868.07.patch)

> Add ability for DistCp to run between 2 clusters
> 
>
> Key: HDFS-9868
> URL: https://issues.apache.org/jira/browse/HDFS-9868
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: distcp
>Affects Versions: 2.7.1
>Reporter: NING DING
>Assignee: NING DING
> Attachments: HDFS-9868.05.patch, HDFS-9868.06.patch, 
> HDFS-9868.07.patch, HDFS-9868.1.patch, HDFS-9868.2.patch, HDFS-9868.3.patch, 
> HDFS-9868.4.patch
>
>
> Normally the HDFS cluster is HA enabled. It could take a long time when 
> coping huge data by distp. If the source cluster changes active namenode, the 
> distp will run failed. This patch supports the DistCp can read source cluster 
> files in HA access mode. A source cluster configuration file needs to be 
> specified (via the -sourceClusterConf option).
>   The following is an example of the contents of a source cluster 
> configuration
>   file:
> {code:xml}
> 
>   
>   fs.defaultFS
>   hdfs://mycluster
> 
> 
>   dfs.nameservices
>   mycluster
> 
> 
>   dfs.ha.namenodes.mycluster
>   nn1,nn2
> 
> 
>   dfs.namenode.rpc-address.mycluster.nn1
>   host1:9000
> 
> 
>   dfs.namenode.rpc-address.mycluster.nn2
>   host2:9000
> 
> 
>   dfs.namenode.http-address.mycluster.nn1
>   host1:50070
> 
> 
>   dfs.namenode.http-address.mycluster.nn2
>   host2:50070
> 
> 
>   dfs.client.failover.proxy.provider.mycluster
>   
> org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
> 
>   
> {code}
>   The invocation of DistCp is as below:
> {code}
> bash$ hadoop distcp -sourceClusterConf sourceCluster.xml /foo/bar 
> hdfs://nn2:8020/bar/foo
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-9868) Add ability for DistCp to run between 2 clusters

2017-02-16 Thread Xiao Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Chen updated HDFS-9868:

Attachment: HDFS-9868.07.patch

Reviving this jira, attaching patch 7, which is after a new review of patch 6, 
with mostly cosmetic fixes.

The only limitation on the current approach is it has to be run from target 
cluster. Ideally we should support both source and target (i.e. distcp from 
either one, as long as you have the remote conf). But as commented above 
couldn't think of a way to generalized the 'remote' concept. We could add a 
{{-targetClusterConf}} later, and reuse most of the conf code, but let's focus 
on the current patch now.

[~yzhangal], could you please review? Thanks much.

> Add ability for DistCp to run between 2 clusters
> 
>
> Key: HDFS-9868
> URL: https://issues.apache.org/jira/browse/HDFS-9868
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: distcp
>Affects Versions: 2.7.1
>Reporter: NING DING
>Assignee: NING DING
> Attachments: HDFS-9868.05.patch, HDFS-9868.06.patch, 
> HDFS-9868.07.patch, HDFS-9868.1.patch, HDFS-9868.2.patch, HDFS-9868.3.patch, 
> HDFS-9868.4.patch
>
>
> Normally the HDFS cluster is HA enabled. It could take a long time when 
> coping huge data by distp. If the source cluster changes active namenode, the 
> distp will run failed. This patch supports the DistCp can read source cluster 
> files in HA access mode. A source cluster configuration file needs to be 
> specified (via the -sourceClusterConf option).
>   The following is an example of the contents of a source cluster 
> configuration
>   file:
> {code:xml}
> 
>   
>   fs.defaultFS
>   hdfs://mycluster
> 
> 
>   dfs.nameservices
>   mycluster
> 
> 
>   dfs.ha.namenodes.mycluster
>   nn1,nn2
> 
> 
>   dfs.namenode.rpc-address.mycluster.nn1
>   host1:9000
> 
> 
>   dfs.namenode.rpc-address.mycluster.nn2
>   host2:9000
> 
> 
>   dfs.namenode.http-address.mycluster.nn1
>   host1:50070
> 
> 
>   dfs.namenode.http-address.mycluster.nn2
>   host2:50070
> 
> 
>   dfs.client.failover.proxy.provider.mycluster
>   
> org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
> 
>   
> {code}
>   The invocation of DistCp is as below:
> {code}
> bash$ hadoop distcp -sourceClusterConf sourceCluster.xml /foo/bar 
> hdfs://nn2:8020/bar/foo
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-9868) Add ability for DistCp to run between 2 clusters

2017-02-16 Thread Xiao Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Chen updated HDFS-9868:

Summary: Add ability for DistCp to run between 2 clusters  (was: Add 
ability to read remote cluster configuration for DistCp)

> Add ability for DistCp to run between 2 clusters
> 
>
> Key: HDFS-9868
> URL: https://issues.apache.org/jira/browse/HDFS-9868
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: distcp
>Affects Versions: 2.7.1
>Reporter: NING DING
>Assignee: NING DING
> Attachments: HDFS-9868.05.patch, HDFS-9868.06.patch, 
> HDFS-9868.1.patch, HDFS-9868.2.patch, HDFS-9868.3.patch, HDFS-9868.4.patch
>
>
> Normally the HDFS cluster is HA enabled. It could take a long time when 
> coping huge data by distp. If the source cluster changes active namenode, the 
> distp will run failed. This patch supports the DistCp can read source cluster 
> files in HA access mode. A source cluster configuration file needs to be 
> specified (via the -sourceClusterConf option).
>   The following is an example of the contents of a source cluster 
> configuration
>   file:
> {code:xml}
> 
>   
>   fs.defaultFS
>   hdfs://mycluster
> 
> 
>   dfs.nameservices
>   mycluster
> 
> 
>   dfs.ha.namenodes.mycluster
>   nn1,nn2
> 
> 
>   dfs.namenode.rpc-address.mycluster.nn1
>   host1:9000
> 
> 
>   dfs.namenode.rpc-address.mycluster.nn2
>   host2:9000
> 
> 
>   dfs.namenode.http-address.mycluster.nn1
>   host1:50070
> 
> 
>   dfs.namenode.http-address.mycluster.nn2
>   host2:50070
> 
> 
>   dfs.client.failover.proxy.provider.mycluster
>   
> org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
> 
>   
> {code}
>   The invocation of DistCp is as below:
> {code}
> bash$ hadoop distcp -sourceClusterConf sourceCluster.xml /foo/bar 
> hdfs://nn2:8020/bar/foo
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org