[jira] [Commented] (HADOOP-10400) Incorporate new S3A FileSystem implementation

Hadoop QA (JIRA) Tue, 01 Jul 2014 09:23:30 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049018#comment-14049018
 ]


Hadoop QA commented on HADOOP-10400:
------------------------------------

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12653407/HADOOP-10400-6.patch
  against trunk revision .

    {color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

    {color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

    {color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

    {color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

    {color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

    {color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

    {color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

    {color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-client hadoop-common-project/hadoop-common 
hadoop-hdfs-project/hadoop-hdfs-httpfs.

    {color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/4194//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/4194//console

This message is automatically generated.

> Incorporate new S3A FileSystem implementation
> ---------------------------------------------
>
>                 Key: HADOOP-10400
>                 URL: https://issues.apache.org/jira/browse/HADOOP-10400
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs, fs/s3
>    Affects Versions: 2.4.0
>            Reporter: Jordan Mendelson
>            Assignee: Jordan Mendelson
>         Attachments: HADOOP-10400-1.patch, HADOOP-10400-2.patch, 
> HADOOP-10400-3.patch, HADOOP-10400-4.patch, HADOOP-10400-5.patch, 
> HADOOP-10400-6.patch
>
>
> The s3native filesystem has a number of limitations (some of which were 
> recently fixed by HADOOP-9454). This patch adds an s3a filesystem which uses 
> the aws-sdk instead of the jets3t library. There are a number of improvements 
> over s3native including:
> - Parallel copy (rename) support (dramatically speeds up commits on large 
> files)
> - AWS S3 explorer compatible empty directories files "xyz/" instead of 
> "xyz_$folder$" (reduces littering)
> - Ignores s3native created _$folder$ files created by s3native and other S3 
> browsing utilities
> - Supports multiple output buffer dirs to even out IO when uploading files
> - Supports IAM role-based authentication
> - Allows setting a default canned ACL for uploads (public, private, etc.)
> - Better error recovery handling
> - Should handle input seeks without having to download the whole file (used 
> for splits a lot)
> This code is a copy of https://github.com/Aloisius/hadoop-s3a with patches to 
> various pom files to get it to build against trunk. I've been using 0.0.1 in 
> production with CDH 4 for several months and CDH 5 for a few days. The 
> version here is 0.0.2 which changes around some keys to hopefully bring the 
> key name style more inline with the rest of hadoop 2.x.
> *Tunable parameters:*
>     fs.s3a.access.key - Your AWS access key ID (omit for role authentication)
>     fs.s3a.secret.key - Your AWS secret key (omit for role authentication)
>     fs.s3a.connection.maximum - Controls how many parallel connections 
> HttpClient spawns (default: 15)
>     fs.s3a.connection.ssl.enabled - Enables or disables SSL connections to S3 
> (default: true)
>     fs.s3a.attempts.maximum - How many times we should retry commands on 
> transient errors (default: 10)
>     fs.s3a.connection.timeout - Socket connect timeout (default: 5000)
>     fs.s3a.paging.maximum - How many keys to request from S3 when doing 
> directory listings at a time (default: 5000)
>     fs.s3a.multipart.size - How big (in bytes) to split a upload or copy 
> operation up into (default: 104857600)
>     fs.s3a.multipart.threshold - Until a file is this large (in bytes), use 
> non-parallel upload (default: 2147483647)
>     fs.s3a.acl.default - Set a canned ACL on newly created/copied objects 
> (private | public-read | public-read-write | authenticated-read | 
> log-delivery-write | bucket-owner-read | bucket-owner-full-control)
>     fs.s3a.multipart.purge - True if you want to purge existing multipart 
> uploads that may not have been completed/aborted correctly (default: false)
>     fs.s3a.multipart.purge.age - Minimum age in seconds of multipart uploads 
> to purge (default: 86400)
>     fs.s3a.buffer.dir - Comma separated list of directories that will be used 
> to buffer file writes out of (default: uses ${hadoop.tmp.dir}/s3a )
> *Caveats*:
> Hadoop uses a standard output committer which uploads files as 
> filename.COPYING before renaming them. This can cause unnecessary performance 
> issues with S3 because it does not have a rename operation and S3 already 
> verifies uploads against an md5 that the driver sets on the upload request. 
> While this FileSystem should be significantly faster than the built-in 
> s3native driver because of parallel copy support, you may want to consider 
> setting a null output committer on our jobs to further improve performance.
> Because S3 requires the file length and MD5 to be known before a file is 
> uploaded, all output is buffered out to a temporary file first similar to the 
> s3native driver.
> Due to the lack of native rename() for S3, renaming extremely large files or 
> directories make take a while. Unfortunately, there is no way to notify 
> hadoop that progress is still being made for rename operations, so your job 
> may time out unless you increase the task timeout.
> This driver will fully ignore _$folder$ files. This was necessary so that it 
> could interoperate with repositories that have had the s3native driver used 
> on them, but means that it won't recognize empty directories that s3native 
> has been used on.
> Statistics for the filesystem may be calculated differently than the s3native 
> filesystem. When uploading a file, we do not count writing the temporary file 
> on the local filesystem towards the local filesystem's written bytes count. 
> When renaming files, we do not count the S3->S3 copy as read or write 
> operations. Unlike the s3native driver, we only count bytes written when we 
> start the upload (as opposed to the write calls to the temporary local file). 
> The driver also counts read & write ops, but they are done mostly to keep 
> from timing out on large s3 operations.
> The AWS SDK unfortunately passes the multipart threshold as an int which means
> fs.s3a.multipart.threshold can not be greater than 2^31-1 (2147483647).
> This is currently implemented as a FileSystem and not a AbstractFileSystem.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HADOOP-10400) Incorporate new S3A FileSystem implementation

Reply via email to