[ https://issues.apache.org/jira/browse/HADOOP-13655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15683631#comment-15683631 ]
ASF GitHub Bot commented on HADOOP-13655: ----------------------------------------- Github user steveloughran commented on a diff in the pull request: https://github.com/apache/hadoop/pull/131#discussion_r88898982 --- Diff: hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm --- @@ -470,6 +470,105 @@ $H3 SSL Configurations for HSFTP sources The SSL configuration file must be in the class-path of the DistCp program. +$H3 DistCp and Object Stores + +DistCp works with Object Stores such as Amazon S3, Azure WASB and OpenStack Swift. + +Prequisites + +1. The JAR containing the object store implementation is on the classpath, +along with all of its dependencies. +1. Unless the JAR automatically registers its bundled filesystem clients, +the configuration may need to be modified to state the class which +implements the filesystem schema. All of the ASF's own object store clients +are self-registering. +1. The relevant object store access credentials must be available in the cluster +configuration, or be otherwise available in all cluster hosts. + +DistCp can be used to upload data + +```bash +hadoop distcp hdfs://nn1:8020/datasets/set1 s3a://bucket/datasets/set1 +``` + +To download data + +```bash +hadoop distcp s3a://bucket/generated/results hdfs://nn1:8020/results +``` + +To copy data between object stores + +```bash +hadoop distcp s3a://bucket/generated/results \ + wasb://upda...@example.blob.core.windows.net +``` + +And do copy data within an object store + +```bash +hadoop distcp wasb://upda...@example.blob.core.windows.net/current \ + wasb://upda...@example.blob.core.windows.net/old +``` + +And to use `-update` to only copy changed files. + +```bash +hadoop distcp -update -numListstatusThreads 20 \ + swift://history.cluster1/2016 \ + hdfs://nn1:8020/history/2016 +``` + +Because object stores are slow to list files, consider setting the `-numListstatusThreads` option when performing a `-update` operation +on a large directory tree (the limit is 40 threads). + +When `DistCp -update` is used with objec stores, +generally only the modification time and length of the individual files are compared, +not any checksums. The fact that most object stores do have valid timestamps +for directories is irrelevant; only the file timestamps are compared. +However, it is important to have the clock of the client computers close +to that of the infrastructure, so that timestamps are consistent between +the client/HDFS cluster and that of the object store. Otherwise, changed files may be +missed/copied too often. + +**Notes** + +* The `-atomic` option causes a rename of the temporary data, so significantly +increases the time to commit work at the end of the operation. Furthermore, +as Object Stores other than (optionally) `wasb://` do not offer atomic renames of directories +the `-atomic` operation doesn't actually deliver what is promised. *Avoid*. + +* The `-append` option is not supported. + +* The `-diff` option is not supported --- End diff -- ok > document object store use with fs shell and distcp > -------------------------------------------------- > > Key: HADOOP-13655 > URL: https://issues.apache.org/jira/browse/HADOOP-13655 > Project: Hadoop Common > Issue Type: Sub-task > Components: documentation, fs, fs/s3 > Affects Versions: 2.7.3 > Reporter: Steve Loughran > Assignee: Steve Loughran > > There's no specific docs for working with object stores from the {{hadoop > fs}} shell or in distcp; people either suffer from this (performance, > billing), or learn through trial and error what to do. > Add a section in both fs shell and distcp docs covering use with object > stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org