[Hadoop Wiki] Update of "AmazonS3" by SteveLoughran

Apache Wiki Wed, 27 Jun 2018 08:22:28 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The "AmazonS3" page has been changed by SteveLoughran:
https://wiki.apache.org/hadoop/AmazonS3?action=diff&rev1=23&rev2=24

Comment:
purge down to the minimum, point people at troubleshooting, tell them not to 
mix JARs.

  = S3 Support in Apache Hadoop =
  
+ Apache Hadoop ships with a connector to S3 called "S3A", with the url prefix 
"s3a:"; its previous connectors "s3", and "s3n" are deprecated and/or deleted 
from recent Hadoop versions.
- [[http://aws.amazon.com/s3|Amazon S3]] (Simple Storage Service) is a data 
storage service. You are billed
- monthly for storage and data transfer. Transfer between S3 and [[AmazonEC2]] 
instances in the same geographical location are free. Most importantly, the 
data is preserved when a transient Hadoop cluster is shut down
  
- This makes use of S3 common in Hadoop clusters on EC2. It is also used 
sometimes for backing up remote cluster.
- 
- Hadoop provides multiple filesystem clients for reading and writing to and 
from Amazon S3 or compatible service.
+  1. Consult the 
[[http://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html|Latest
 Hadoop documentation]] for the specifics on using any the S3A connector.
+  1. For Hadoop 2.x releases, the latest 
[[https://github.com/apache/hadoop/blob/branch-2/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md|troubleshooting
 documentation]].
+  1. For Hadoop 3.x releases, the latest 
[[https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/troubleshooting_s3a.md|troubleshooting
 documentation]].
  
  
- === Recommended: S3A (URI scheme: s3a://) - Hadoop 2.7+ ===
+ == S3 Support in Amazon EMR ==
  
+ Amazon's EMR Service is based upon Apache Hadoop, but contains modifications 
and their own closed-source S3 client. Consult 
[[http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-plan-file-systems.html|Amazon's
 documentation on this]].
+ Only Amazon can provide support and/or field bug reports related to their S3 
support.
- '''S3A is the recommended S3 Client for Hadoop 2.7 and later'''
- 
- A successor to the S3 Native, s3n:// filesystem, the S3a: system uses 
Amazon's libraries to interact with S3. This allows S3a to support larger files 
(no more 5GB limit), higher performance operations and more. The filesystem is 
intended to be a replacement for/successor to S3 Native: all objects accessible 
from s3n:// URLs should also be accessible from s3a simply by replacing the URL 
schema.
- 
- S3A has been usable in production since Hadoop 2.7, and is undergoing active 
maintenance for enhanced security, scalability and performance.
- 
- History
- 
-  1. Hadoop 2.6: Initial Implementation: 
[[https://issues.apache.org/jira/browse/HADOOP-10400|HADOOP-10400]]
-  2. Hadoop 2.7: Production Ready: 
[[https://issues.apache.org/jira/browse/HADOOP-11571|HADOOP-11571]]
-  3. Hadoop 2.8: Performance, robustness and security 
[[https://issues.apache.org/jira/browse/HADOOP-11694|HADOOP-11694]]
-  4. Hadoop 2.9: Even more features: 
[[https://issues.apache.org/jira/browse/HADOOP-13204|HADOOP-13204]]
- 
- July 2016: For details of ongoing work on S3a, consult 
[[www.slideshare.net/HadoopSummit/hadoop-cloud-storage-object-store-integration-in-production|Hadoop
 & Cloud Storage: Object Store Integration in Production]]
- 
- '''important:''' S3A requires the exact version of the amazon-aws-sdk against 
which Hadoop was built (and is bundled with). If you try to upgrade the library 
by dropping in a later version, things will break.
  
  
- === Unmainteained: S3N FileSystem (URI scheme: s3n://) ===
+ == Important: Classpath setup ==
  
+  1. The S3A connector is implemented in the hadoop-aws JAR. If it is not on 
the classpath: stack trace.
+  1. Do not attempt to mix a "hadoop-aws" version with other hadoop artifacts 
from different versions. They must be from exactly the same release. Otherwise: 
stack trace.
+  1. The S3A connector is depends on AWS SDK JARs. If they are not on the 
classpath: stack trace.
+  1. Do not attempt to use an amazon S3 SDK JAR different from the one which 
the hadoop version was built with. Otherwise: stack trace highly likely.
+  1. The normative list of dependencies of a specific version of the 
hadoop-aws JAR are stored in Maven, which can be viewed on 
[[http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws|mvnrepsitory]].
- '''S3N is the S3 Client for Hadoop 2.6 and earlier. From Hadoop 2.7+, switch 
to s3a'''
- 
- A native filesystem for reading and writing regular files on S3.With this 
filesystem is that you can access files on S3 that were written with other 
tools. Conversely, other tools can access files written using Hadoop. The S3N 
code is stable and widely used, but is not adding any new features (which is 
why it remains stable).
- 
- S3N requires a compatible version of the jets3t JAR on the classpath.
- 
- Since Hadoop 2.6, all work on S3 integration has been with S3A. S3N is not 
maintained except for security risks —this helps guarantee security. Most bug 
reports against S3N will be closed as WONTFIX and the text "use S3A". Please 
switch to S3A if you can -and do try it before filing bug reports against S3N.
  
  
+ === Important: you need a consistency layer to use S3 aa a destination of 
MapReduce, Spark and Hive work ===
- === (Deprecated) S3 Block FileSystem (URI scheme: s3://) ===
- 
- '''S3 is deprecated and will be removed from Hadoop 3.0'''
- 
- '''important:''' this section covers the s3:// filesystem support from the 
Apache Software Foundation. The one in Amazon EMR is different —see the details 
at the bottom of this page.
- 
- A block-based filesystem backed by S3. Files are stored as blocks, just like 
they are in HDFS. This permits efficient implementation of renames. This 
filesystem requires you to dedicate a bucket for the filesystem - you should 
not use an existing bucket containing files, or write other files to the same 
bucket. The files stored by this filesystem can be larger than 5GB, but they 
are not interoperable with other S3 tools. Nobody is/should be uploading data 
to S3 via this scheme any more; it will eventually be removed from Hadoop 
entirely. Consider it (as of May 2016), deprecated.
- 
- 
- == History ==
-  * The S3 block filesystem was introduced in Hadoop 0.10.0 
([[http://issues.apache.org/jira/browse/HADOOP-574|HADOOP-574]]).
-  * The S3 native filesystem was introduced in Hadoop 0.18.0 
([[http://issues.apache.org/jira/browse/HADOOP-930|HADOOP-930]]) and rename 
support was added in Hadoop 0.19.0 
([[https://issues.apache.org/jira/browse/HADOOP-3361|HADOOP-3361]]).
-  * The S3A filesystem was introduced in Hadoop 2.6.0. Some issues were found 
and fixed for later Hadoop versions 
[[https://issues.apache.org/jira/browse/HADOOP-11571|HADOOP-11571]].
- 
- 
- == Working with S3 from Apache Hadoop ==
- 
- Consult the 
[[http://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html|Latest
 Hadoop documentation]] for the specifics on using any of the S3 clients.
- 
- 
- === Important: you cannot use S3 as a replacement for HDFS ===
  
  You cannot use any of the S3 filesystem clients as a drop-in replacement for 
HDFS. Amazon S3 is an "object store" with
-  * eventual consistency: changes made by one application (creation, updates 
and deletions) will not be visible until some undefined time.
+  * Eventual consistency: changes made by one application (creation, updates 
and deletions) will not be visible until some undefined time.
-  * s3n and s3a: non-atomic rename and delete operations. Renaming or deleting 
large directories takes time proportional to the number of entries -and visible 
to other processes during this time, and indeed, until the eventual consistency 
has been resolved.
+  * Non-atomic rename and delete operations. Renaming or deleting large 
directories takes time proportional to the number of entries -and visible to 
other processes during this time, and indeed, until the eventual consistency 
has been resolved. This breaks the commit protocol used by all these 
applications to safely commit the output of multiple tasks within a job.
  
+ Hadoop 3.x ships with 
[[https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/s3guard.md|S3Guard]]
 for consistency, and the 
[[https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md|S3A
 Committers]] for committing work.
- S3 is not a filesystem. The Hadoop S3 filesystem bindings make it pretend to 
be a filesystem, but it is not. It can
- act as a source of data, and as a destination -though in the latter case, you 
must remember that the output may not be immediately visible.
  
- === Security ===
+ For Amazon EMR, use "Consistent EMR" for a consistent view of the store.
+ 
+ === Important: Security ===
  
  Your Amazon Secret Access Key is that: secret. If it gets known you have to 
go to the [[https://portal.aws.amazon.com/gp/aws/securityCredentials|Security 
Credentials]] page and revoke it. Try and avoid printing it in logs, or 
checking the XML configuration files into revision control.
  
   1. Do not ever check it in to revision control systems.
   1. Although the clients (currently) support embedding the credentials in the 
URI, this is very dangerous: it will appear in logs and error messages. Avoid 
this.
-  1. S3A supports more authentication mechanisms: consult the documentation 
and, ideally, use one.
+  1. S3A will automatically get credentials of the current IAM Role when 
running on an EC2 VM.
  
- === Running bulk copies in and out of S3 ===
- 
- Support for the S3 block filesystem was added to the 
`${HADOOP_HOME}/bin/hadoop distcp` tool in Hadoop 0.11.0 (See 
[[https://issues.apache.org/jira/browse/HADOOP-862|HADOOP-862]]).  The `distcp` 
tool sets up a MapReduce job to run the copy.  Using `distcp`, a cluster of 
many members can copy lots of data quickly.  The number of map tasks is 
calculated by counting the number of files in the source: i.e. each map task is 
responsible for the copying one file.  Source and target may refer to disparate 
filesystem types.  For example, source might refer to the local filesystem or 
`hdfs` with `S3` as the target.
- 
- The `distcp` tool is useful for quickly prepping S3 for MapReduce jobs that 
use S3 for input or for backing up the content of `hdfs`.
- 
- Here is an example copying a nutch segment named `0070206153839-1998` at 
`/user/nutch` in `hdfs` to an S3 bucket named 'nutch' (Let the S3 
AWS_ACCESS_KEY_ID be `123` and the S3 AWS_ACCESS_KEY_SECRET be `456`):
- 
- 
- {{{
- % ${HADOOP_HOME}/bin/hadoop distcp 
hdfs://domU-12-31-33-00-02-DF:9001/user/nutch/0070206153839-1998 s3a://nutch/
- }}}
- 
- Flip the arguments if you want to run the copy in the opposite direction.
- 
- Other schemes supported by `distcp` include `file:` (for local), and `http:`.
- 
- == S3 Support in Amazon EMR ==
- 
- Amazon's EMR Service is based upon Apache Hadoop, but contains modifications 
and their own, proprietary, S3 client. Consult 
[[http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-plan-file-systems.html|Amazon's
 documentation on this]]. Due to the fact that their code is proprietary, only 
Amazon can provide support and/or field bug reports related to their S3 support.
- 

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[Hadoop Wiki] Update of "AmazonS3" by SteveLoughran

Reply via email to