[
https://issues.apache.org/jira/browse/HADOOP-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13929915#comment-13929915
]
Hadoop QA commented on HADOOP-10400:
------------------------------------
{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12633839/HADOOP-10400-1.patch
against trunk revision .
{color:green}+1 @author{color}. The patch does not contain any @author
tags.
{color:green}+1 tests included{color}. The patch appears to include 2 new
or modified test files.
{color:green}+1 javac{color}. The applied patch does not increase the
total number of javac compiler warnings.
{color:red}-1 javadoc{color}. The javadoc tool appears to have generated 2
warning messages.
See
https://builds.apache.org/job/PreCommit-HADOOP-Build/3657//artifact/trunk/patchprocess/diffJavadocWarnings.txt
for details.
{color:green}+1 eclipse:eclipse{color}. The patch built with
eclipse:eclipse.
{color:red}-1 findbugs{color}. The patch appears to introduce 4 new
Findbugs (version 1.3.9) warnings.
{color:green}+1 release audit{color}. The applied patch does not increase
the total number of release audit warnings.
{color:green}+1 core tests{color}. The patch passed unit tests in
hadoop-client hadoop-common-project/hadoop-common
hadoop-hdfs-project/hadoop-hdfs-httpfs.
{color:green}+1 contrib tests{color}. The patch passed contrib unit tests.
Test results:
https://builds.apache.org/job/PreCommit-HADOOP-Build/3657//testReport/
Findbugs warnings:
https://builds.apache.org/job/PreCommit-HADOOP-Build/3657//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-common.html
Console output:
https://builds.apache.org/job/PreCommit-HADOOP-Build/3657//console
This message is automatically generated.
> Incorporate new S3A FileSystem implementation
> ---------------------------------------------
>
> Key: HADOOP-10400
> URL: https://issues.apache.org/jira/browse/HADOOP-10400
> Project: Hadoop Common
> Issue Type: Improvement
> Components: fs
> Reporter: Jordan Mendelson
> Attachments: HADOOP-10400-1.patch, HADOOP-10400-2.patch
>
>
> The s3native filesystem has a number of limitations (some of which were
> recently fixed by HADOOP-9454). This patch adds an s3a filesystem which uses
> the aws-sdk instead of the jets3t library. There are a number of improvements
> over s3native including:
> - Parallel copy (rename) support (dramatically speeds up commits on large
> files)
> - AWS S3 explorer compatible empty directories files "xyz/" instead of
> "xyz_$folder$" (reduces littering)
> - Ignores s3native created _$folder$ files created by s3native and other S3
> browsing utilities
> - Supports multiple output buffer dirs to even out IO when uploading files
> - Supports IAM role-based authentication
> - Allows setting a default canned ACL for uploads (public, private, etc.)
> - Better error recovery handling
> - Should handle input seeks without having to download the whole file (used
> for splits a lot)
> This code is a copy of https://github.com/Aloisius/hadoop-s3a with patches to
> various pom files to get it to build against trunk. I've been using 0.0.1 in
> production with CDH 4 for several months and CDH 5 for a few days. The
> version here is 0.0.2 which changes around some keys to hopefully bring the
> key name style more inline with the rest of hadoop 2.x.
> *Tunable parameters:*
> fs.s3a.access.key - Your AWS access key ID (omit for role authentication)
> fs.s3a.secret.key - Your AWS secret key (omit for role authentication)
> fs.s3a.connection.maximum - Controls how many parallel connections
> HttpClient spawns (default: 15)
> fs.s3a.connection.ssl.enabled - Enables or disables SSL connections to S3
> (default: true)
> fs.s3a.attempts.maximum - How many times we should retry commands on
> transient errors (default: 10)
> fs.s3a.connection.timeout - Socket connect timeout (default: 5000)
> fs.s3a.paging.maximum - How many keys to request from S3 when doing
> directory listings at a time (default: 5000)
> fs.s3a.multipart.size - How big (in bytes) to split a upload or copy
> operation up into (default: 104857600)
> fs.s3a.multipart.threshold - Until a file is this large (in bytes), use
> non-parallel upload (default: 2147483647)
> fs.s3a.acl.default - Set a canned ACL on newly created/copied objects
> (private | public-read | public-read-write | authenticated-read |
> log-delivery-write | bucket-owner-read | bucket-owner-full-control)
> fs.s3a.multipart.purge - True if you want to purge existing multipart
> uploads that may not have been completed/aborted correctly (default: false)
> fs.s3a.multipart.purge.age - Minimum age in seconds of multipart uploads
> to purge (default: 86400)
> fs.s3a.buffer.dir - Comma separated list of directories that will be used
> to buffer file writes out of (default: uses fs.s3.buffer.dir)
> *Caveats*:
> Hadoop uses a standard output committer which uploads files as
> filename.COPYING before renaming them. This can cause unnecessary performance
> issues with S3 because it does not have a rename operation and S3 already
> verifies uploads against an md5 that the driver sets on the upload request.
> While this FileSystem should be significantly faster than the built-in
> s3native driver because of parallel copy support, you may want to consider
> setting a null output committer on our jobs to further improve performance.
> Because S3 requires the file length and MD5 to be known before a file is
> uploaded, all output is buffered out to a temporary file first similar to the
> s3native driver.
> Due to the lack of native rename() for S3, renaming extremely large files or
> directories make take a while. Unfortunately, there is no way to notify
> hadoop that progress is still being made for rename operations, so your job
> may time out unless you increase the task timeout.
> This driver will fully ignore _$folder$ files. This was necessary so that it
> could interoperate with repositories that have had the s3native driver used
> on them, but means that it won't recognize empty directories that s3native
> has been used on.
> Statistics for the filesystem may be calculated differently than the s3native
> filesystem. When uploading a file, we do not count writing the temporary file
> on the local filesystem towards the local filesystem's written bytes count.
> When renaming files, we do not count the S3->S3 copy as read or write
> operations. Unlike the s3native driver, we only count bytes written when we
> start the upload (as opposed to the write calls to the temporary local file).
> The driver also counts read & write ops, but they are done mostly to keep
> from timing out on large s3 operations.
> The AWS SDK unfortunately passes the multipart threshold as an int which means
> fs.s3a.multipart.threshold can not be greater than 2^31-1 (2147483647).
> This is currently implemented as a FileSystem and not a AbstractFileSystem.
--
This message was sent by Atlassian JIRA
(v6.2#6252)