Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.
The "AmazonS3" page has been changed by SteveLoughran: https://wiki.apache.org/hadoop/AmazonS3?action=diff&rev1=20&rev2=21 Comment: lots more on S3a and why to use it, warnings of state of s3n and deprecation of s3 = S3 Support in Apache Hadoop = [[http://aws.amazon.com/s3|Amazon S3]] (Simple Storage Service) is a data storage service. You are billed - monthly for storage and data transfer. Transfer between S3 and [[AmazonEC2]] instances in the same geographical location are free. This makes use of - S3 attractive for Hadoop users who run clusters on EC2. + monthly for storage and data transfer. Transfer between S3 and [[AmazonEC2]] instances in the same geographical location are free. Most importantly, the data is preserved when a transient Hadoop cluster is shut down + + This makes use of S3 common in Hadoop clusters on EC2. It is also used sometimes for backing up remote cluster. Hadoop provides multiple filesystem clients for reading and writing to and from Amazon S3 or compatible service. - === S3 Native FileSystem (URI scheme: s3n) === - A native filesystem for reading and writing regular files on S3. The advantage of this filesystem is that you can access files on S3 that were written with other tools. Conversely, other tools can access files written using Hadoop. The S3N code is stable and widely used, but is not adding any new features (which is why it remains stable). S3N requires a suitable version of the jets3t JAR on the classpath. + === Recommended: S3A (URI scheme: s3a://) - Hadoop 2.7+ === - === S3A (URI scheme: s3a) === + '''S3A is the recommended S3 Client for Hadoop 2.7 and later''' A successor to the S3 Native, s3n:// filesystem, the S3a: system uses Amazon's libraries to interact with S3. This allows S3a to support larger files (no more 5GB limit), higher performance operations and more. The filesystem is intended to be a replacement for/successor to S3 Native: all objects accessible from s3n:// URLs should also be accessible from s3a simply by replacing the URL schema. - S3A has been considered usable in production since Hadoop 2.7, and is undergoing active maintenance for enhanced security, scalability and performance. + S3A has been usable in production since Hadoop 2.7, and is undergoing active maintenance for enhanced security, scalability and performance. - '''important:''' S3A requires the exact version of the amazon-aws-sdk against which Hadoop was built (and is bundled with). + History - === S3 Block FileSystem (URI scheme: s3) === + 1. Hadoop 2.6: Initial Implementation: [[https://issues.apache.org/jira/browse/HADOOP-10400|HADOOP-10400]] + 2. Hadoop 2.7: Production Ready: [[https://issues.apache.org/jira/browse/HADOOP-11571|HADOOP-11571]] + 3. Hadoop 2.8: Performance, robustness and security [[https://issues.apache.org/jira/browse/HADOOP-11694|HADOOP-11694]] + 4. Hadoop 2.9: Even more features: [[https://issues.apache.org/jira/browse/HADOOP-13204|HADOOP-13204]] + July 2016: For details of ongoing work on S3a, consult [[www.slideshare.net/HadoopSummit/hadoop-cloud-storage-object-store-integration-in-production|Hadoop & Cloud Storage: Object Store Integration in Production]] + + '''important:''' S3A requires the exact version of the amazon-aws-sdk against which Hadoop was built (and is bundled with). If you try to upgrade the library by dropping in a later version, things will break. + + + === Unmainteained: S3N FileSystem (URI scheme: s3n://) === + + '''S3A is the S3 Client for Hadoop 2.6 and earlier. From Hadoop 2.7+, switch to s3a''' + + A native filesystem for reading and writing regular files on S3.With this filesystem is that you can access files on S3 that were written with other tools. Conversely, other tools can access files written using Hadoop. The S3N code is stable and widely used, but is not adding any new features (which is why it remains stable). + + S3N requires a compatible version of the jets3t JAR on the classpath. + + Since Hadoop 2.6, all work on S3 integration has been with S3A. S3N is not maintained except for security risks —this helps guarantee security. Most bug reports against S3N will be closed as WONTFIX and the text "use S3A". Please switch to S3A if you can -and do try it before filing bug reports against S3N. + + + === (Deprecated) S3 Block FileSystem (URI scheme: s3://) === + + '''S3 is deprecated and will be removed from Hadoop 2.3''' + - '''important:''' this section covers the s3:// filesystem support inside Apache Hadoop. The one in Amazon EMR is different —see the details at the bottom of this page. + '''important:''' this section covers the s3:// filesystem support from the Apache Software Foundation. The one in Amazon EMR is different —see the details at the bottom of this page. A block-based filesystem backed by S3. Files are stored as blocks, just like they are in HDFS. This permits efficient implementation of renames. This filesystem requires you to dedicate a bucket for the filesystem - you should not use an existing bucket containing files, or write other files to the same bucket. The files stored by this filesystem can be larger than 5GB, but they are not interoperable with other S3 tools. Nobody is/should be uploading data to S3 via this scheme any more; it will eventually be removed from Hadoop entirely. Consider it (as of May 2016), deprecated. - - S3 can be used as a convenient repository for data input to and output for analytics applications using either S3 filesystem. - Data in S3 outlasts Hadoop clusters on EC2, so they should be where persistent data must be kept. - - Note that by using S3 as an input you lose the data locality optimization, which may be significant. The general best practise is to copy in data using `distcp` at the start of a workflow, then copy it out at the end, using the transient HDFS in between. == History == * The S3 block filesystem was introduced in Hadoop 0.10.0 ([[http://issues.apache.org/jira/browse/HADOOP-574|HADOOP-574]]). @@ -54, +72 @@ === Security === - Your Amazon Secret Access Key is that: secret. If it gets known you have to go to the [[https://portal.aws.amazon.com/gp/aws/securityCredentials|Security Credentials]] page and revoke it. Try and avoid printing it in logs, or checking the XML configuration files into revision control. Do not ever check it in to revision control systems. + Your Amazon Secret Access Key is that: secret. If it gets known you have to go to the [[https://portal.aws.amazon.com/gp/aws/securityCredentials|Security Credentials]] page and revoke it. Try and avoid printing it in logs, or checking the XML configuration files into revision control. + + 1. Do not ever check it in to revision control systems. + 1. Although the clients (currently) support embedding the credentials in the URI, this is very dangerous: it will appear in logs and error messages. Avoid this. + 1. S3A supports more authentication mechanisms: consult the documentation and, ideally, use one. === Running bulk copies in and out of S3 === @@ -66, +88 @@ {{{ - % ${HADOOP_HOME}/bin/hadoop distcp hdfs://domU-12-31-33-00-02-DF:9001/user/nutch/0070206153839-1998 s3a://123:456@nutch/ + % ${HADOOP_HOME}/bin/hadoop distcp hdfs://domU-12-31-33-00-02-DF:9001/user/nutch/0070206153839-1998 s3a://nutch/ }}} Flip the arguments if you want to run the copy in the opposite direction. --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
