[jira] [Updated] (HADOOP-10400) Incorporate new S3A FileSystem implementation

2015-03-21 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-10400:

Assignee: Jordan Mendelson

 Incorporate new S3A FileSystem implementation
 -

 Key: HADOOP-10400
 URL: https://issues.apache.org/jira/browse/HADOOP-10400
 Project: Hadoop Common
  Issue Type: New Feature
  Components: fs, fs/s3
Affects Versions: 2.4.0
Reporter: Jordan Mendelson
Assignee: Jordan Mendelson
 Fix For: 2.6.0

 Attachments: HADOOP-10400-1.patch, HADOOP-10400-2.patch, 
 HADOOP-10400-3.patch, HADOOP-10400-4.patch, HADOOP-10400-5.patch, 
 HADOOP-10400-6.patch, HADOOP-10400-7.patch, HADOOP-10400-8-branch-2.patch, 
 HADOOP-10400-8.patch, HADOOP-10400-branch-2.patch


 The s3native filesystem has a number of limitations (some of which were 
 recently fixed by HADOOP-9454). This patch adds an s3a filesystem which uses 
 the aws-sdk instead of the jets3t library. There are a number of improvements 
 over s3native including:
 - Parallel copy (rename) support (dramatically speeds up commits on large 
 files)
 - AWS S3 explorer compatible empty directories files xyz/ instead of 
 xyz_$folder$ (reduces littering)
 - Ignores s3native created _$folder$ files created by s3native and other S3 
 browsing utilities
 - Supports multiple output buffer dirs to even out IO when uploading files
 - Supports IAM role-based authentication
 - Allows setting a default canned ACL for uploads (public, private, etc.)
 - Better error recovery handling
 - Should handle input seeks without having to download the whole file (used 
 for splits a lot)
 This code is a copy of https://github.com/Aloisius/hadoop-s3a with patches to 
 various pom files to get it to build against trunk. I've been using 0.0.1 in 
 production with CDH 4 for several months and CDH 5 for a few days. The 
 version here is 0.0.2 which changes around some keys to hopefully bring the 
 key name style more inline with the rest of hadoop 2.x.
 *Tunable parameters:*
 fs.s3a.access.key - Your AWS access key ID (omit for role authentication)
 fs.s3a.secret.key - Your AWS secret key (omit for role authentication)
 fs.s3a.connection.maximum - Controls how many parallel connections 
 HttpClient spawns (default: 15)
 fs.s3a.connection.ssl.enabled - Enables or disables SSL connections to S3 
 (default: true)
 fs.s3a.attempts.maximum - How many times we should retry commands on 
 transient errors (default: 10)
 fs.s3a.connection.timeout - Socket connect timeout (default: 5000)
 fs.s3a.paging.maximum - How many keys to request from S3 when doing 
 directory listings at a time (default: 5000)
 fs.s3a.multipart.size - How big (in bytes) to split a upload or copy 
 operation up into (default: 104857600)
 fs.s3a.multipart.threshold - Until a file is this large (in bytes), use 
 non-parallel upload (default: 2147483647)
 fs.s3a.acl.default - Set a canned ACL on newly created/copied objects 
 (private | public-read | public-read-write | authenticated-read | 
 log-delivery-write | bucket-owner-read | bucket-owner-full-control)
 fs.s3a.multipart.purge - True if you want to purge existing multipart 
 uploads that may not have been completed/aborted correctly (default: false)
 fs.s3a.multipart.purge.age - Minimum age in seconds of multipart uploads 
 to purge (default: 86400)
 fs.s3a.buffer.dir - Comma separated list of directories that will be used 
 to buffer file writes out of (default: uses ${hadoop.tmp.dir}/s3a )
 *Caveats*:
 Hadoop uses a standard output committer which uploads files as 
 filename.COPYING before renaming them. This can cause unnecessary performance 
 issues with S3 because it does not have a rename operation and S3 already 
 verifies uploads against an md5 that the driver sets on the upload request. 
 While this FileSystem should be significantly faster than the built-in 
 s3native driver because of parallel copy support, you may want to consider 
 setting a null output committer on our jobs to further improve performance.
 Because S3 requires the file length and MD5 to be known before a file is 
 uploaded, all output is buffered out to a temporary file first similar to the 
 s3native driver.
 Due to the lack of native rename() for S3, renaming extremely large files or 
 directories make take a while. Unfortunately, there is no way to notify 
 hadoop that progress is still being made for rename operations, so your job 
 may time out unless you increase the task timeout.
 This driver will fully ignore _$folder$ files. This was necessary so that it 
 could interoperate with repositories that have had the s3native driver used 
 on them, but means that it won't recognize empty directories that s3native 
 has been used on.
 Statistics for the filesystem may 

[jira] [Updated] (HADOOP-10400) Incorporate new S3A FileSystem implementation

2015-02-22 Thread Chris Bannister (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Bannister updated HADOOP-10400:
-
Assignee: (was: Chris Bannister)

 Incorporate new S3A FileSystem implementation
 -

 Key: HADOOP-10400
 URL: https://issues.apache.org/jira/browse/HADOOP-10400
 Project: Hadoop Common
  Issue Type: New Feature
  Components: fs, fs/s3
Affects Versions: 2.4.0
Reporter: Jordan Mendelson
 Fix For: 2.6.0

 Attachments: HADOOP-10400-1.patch, HADOOP-10400-2.patch, 
 HADOOP-10400-3.patch, HADOOP-10400-4.patch, HADOOP-10400-5.patch, 
 HADOOP-10400-6.patch, HADOOP-10400-7.patch, HADOOP-10400-8-branch-2.patch, 
 HADOOP-10400-8.patch, HADOOP-10400-branch-2.patch


 The s3native filesystem has a number of limitations (some of which were 
 recently fixed by HADOOP-9454). This patch adds an s3a filesystem which uses 
 the aws-sdk instead of the jets3t library. There are a number of improvements 
 over s3native including:
 - Parallel copy (rename) support (dramatically speeds up commits on large 
 files)
 - AWS S3 explorer compatible empty directories files xyz/ instead of 
 xyz_$folder$ (reduces littering)
 - Ignores s3native created _$folder$ files created by s3native and other S3 
 browsing utilities
 - Supports multiple output buffer dirs to even out IO when uploading files
 - Supports IAM role-based authentication
 - Allows setting a default canned ACL for uploads (public, private, etc.)
 - Better error recovery handling
 - Should handle input seeks without having to download the whole file (used 
 for splits a lot)
 This code is a copy of https://github.com/Aloisius/hadoop-s3a with patches to 
 various pom files to get it to build against trunk. I've been using 0.0.1 in 
 production with CDH 4 for several months and CDH 5 for a few days. The 
 version here is 0.0.2 which changes around some keys to hopefully bring the 
 key name style more inline with the rest of hadoop 2.x.
 *Tunable parameters:*
 fs.s3a.access.key - Your AWS access key ID (omit for role authentication)
 fs.s3a.secret.key - Your AWS secret key (omit for role authentication)
 fs.s3a.connection.maximum - Controls how many parallel connections 
 HttpClient spawns (default: 15)
 fs.s3a.connection.ssl.enabled - Enables or disables SSL connections to S3 
 (default: true)
 fs.s3a.attempts.maximum - How many times we should retry commands on 
 transient errors (default: 10)
 fs.s3a.connection.timeout - Socket connect timeout (default: 5000)
 fs.s3a.paging.maximum - How many keys to request from S3 when doing 
 directory listings at a time (default: 5000)
 fs.s3a.multipart.size - How big (in bytes) to split a upload or copy 
 operation up into (default: 104857600)
 fs.s3a.multipart.threshold - Until a file is this large (in bytes), use 
 non-parallel upload (default: 2147483647)
 fs.s3a.acl.default - Set a canned ACL on newly created/copied objects 
 (private | public-read | public-read-write | authenticated-read | 
 log-delivery-write | bucket-owner-read | bucket-owner-full-control)
 fs.s3a.multipart.purge - True if you want to purge existing multipart 
 uploads that may not have been completed/aborted correctly (default: false)
 fs.s3a.multipart.purge.age - Minimum age in seconds of multipart uploads 
 to purge (default: 86400)
 fs.s3a.buffer.dir - Comma separated list of directories that will be used 
 to buffer file writes out of (default: uses ${hadoop.tmp.dir}/s3a )
 *Caveats*:
 Hadoop uses a standard output committer which uploads files as 
 filename.COPYING before renaming them. This can cause unnecessary performance 
 issues with S3 because it does not have a rename operation and S3 already 
 verifies uploads against an md5 that the driver sets on the upload request. 
 While this FileSystem should be significantly faster than the built-in 
 s3native driver because of parallel copy support, you may want to consider 
 setting a null output committer on our jobs to further improve performance.
 Because S3 requires the file length and MD5 to be known before a file is 
 uploaded, all output is buffered out to a temporary file first similar to the 
 s3native driver.
 Due to the lack of native rename() for S3, renaming extremely large files or 
 directories make take a while. Unfortunately, there is no way to notify 
 hadoop that progress is still being made for rename operations, so your job 
 may time out unless you increase the task timeout.
 This driver will fully ignore _$folder$ files. This was necessary so that it 
 could interoperate with repositories that have had the s3native driver used 
 on them, but means that it won't recognize empty directories that s3native 
 has been used on.
 Statistics for the filesystem may be calculated differently 

[jira] [Updated] (HADOOP-10400) Incorporate new S3A FileSystem implementation

2014-09-15 Thread Aaron T. Myers (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron T. Myers updated HADOOP-10400:

Issue Type: New Feature  (was: Improvement)

 Incorporate new S3A FileSystem implementation
 -

 Key: HADOOP-10400
 URL: https://issues.apache.org/jira/browse/HADOOP-10400
 Project: Hadoop Common
  Issue Type: New Feature
  Components: fs, fs/s3
Affects Versions: 2.4.0
Reporter: Jordan Mendelson
Assignee: Jordan Mendelson
 Attachments: HADOOP-10400-1.patch, HADOOP-10400-2.patch, 
 HADOOP-10400-3.patch, HADOOP-10400-4.patch, HADOOP-10400-5.patch, 
 HADOOP-10400-6.patch, HADOOP-10400-7.patch, HADOOP-10400-8-branch-2.patch, 
 HADOOP-10400-8.patch, HADOOP-10400-branch-2.patch


 The s3native filesystem has a number of limitations (some of which were 
 recently fixed by HADOOP-9454). This patch adds an s3a filesystem which uses 
 the aws-sdk instead of the jets3t library. There are a number of improvements 
 over s3native including:
 - Parallel copy (rename) support (dramatically speeds up commits on large 
 files)
 - AWS S3 explorer compatible empty directories files xyz/ instead of 
 xyz_$folder$ (reduces littering)
 - Ignores s3native created _$folder$ files created by s3native and other S3 
 browsing utilities
 - Supports multiple output buffer dirs to even out IO when uploading files
 - Supports IAM role-based authentication
 - Allows setting a default canned ACL for uploads (public, private, etc.)
 - Better error recovery handling
 - Should handle input seeks without having to download the whole file (used 
 for splits a lot)
 This code is a copy of https://github.com/Aloisius/hadoop-s3a with patches to 
 various pom files to get it to build against trunk. I've been using 0.0.1 in 
 production with CDH 4 for several months and CDH 5 for a few days. The 
 version here is 0.0.2 which changes around some keys to hopefully bring the 
 key name style more inline with the rest of hadoop 2.x.
 *Tunable parameters:*
 fs.s3a.access.key - Your AWS access key ID (omit for role authentication)
 fs.s3a.secret.key - Your AWS secret key (omit for role authentication)
 fs.s3a.connection.maximum - Controls how many parallel connections 
 HttpClient spawns (default: 15)
 fs.s3a.connection.ssl.enabled - Enables or disables SSL connections to S3 
 (default: true)
 fs.s3a.attempts.maximum - How many times we should retry commands on 
 transient errors (default: 10)
 fs.s3a.connection.timeout - Socket connect timeout (default: 5000)
 fs.s3a.paging.maximum - How many keys to request from S3 when doing 
 directory listings at a time (default: 5000)
 fs.s3a.multipart.size - How big (in bytes) to split a upload or copy 
 operation up into (default: 104857600)
 fs.s3a.multipart.threshold - Until a file is this large (in bytes), use 
 non-parallel upload (default: 2147483647)
 fs.s3a.acl.default - Set a canned ACL on newly created/copied objects 
 (private | public-read | public-read-write | authenticated-read | 
 log-delivery-write | bucket-owner-read | bucket-owner-full-control)
 fs.s3a.multipart.purge - True if you want to purge existing multipart 
 uploads that may not have been completed/aborted correctly (default: false)
 fs.s3a.multipart.purge.age - Minimum age in seconds of multipart uploads 
 to purge (default: 86400)
 fs.s3a.buffer.dir - Comma separated list of directories that will be used 
 to buffer file writes out of (default: uses ${hadoop.tmp.dir}/s3a )
 *Caveats*:
 Hadoop uses a standard output committer which uploads files as 
 filename.COPYING before renaming them. This can cause unnecessary performance 
 issues with S3 because it does not have a rename operation and S3 already 
 verifies uploads against an md5 that the driver sets on the upload request. 
 While this FileSystem should be significantly faster than the built-in 
 s3native driver because of parallel copy support, you may want to consider 
 setting a null output committer on our jobs to further improve performance.
 Because S3 requires the file length and MD5 to be known before a file is 
 uploaded, all output is buffered out to a temporary file first similar to the 
 s3native driver.
 Due to the lack of native rename() for S3, renaming extremely large files or 
 directories make take a while. Unfortunately, there is no way to notify 
 hadoop that progress is still being made for rename operations, so your job 
 may time out unless you increase the task timeout.
 This driver will fully ignore _$folder$ files. This was necessary so that it 
 could interoperate with repositories that have had the s3native driver used 
 on them, but means that it won't recognize empty directories that s3native 
 has been used on.
 Statistics for the filesystem may be calculated 

[jira] [Updated] (HADOOP-10400) Incorporate new S3A FileSystem implementation

2014-09-15 Thread Aaron T. Myers (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron T. Myers updated HADOOP-10400:

  Resolution: Fixed
   Fix Version/s: 2.6.0
Target Version/s: 2.6.0  (was: 3.0.0)
Hadoop Flags: Reviewed
  Status: Resolved  (was: Patch Available)

I've just committed this to trunk and branch-2.

Thanks a lot, Jordan, for the initial implementation and thanks to Dave for 
taking the patch over the finish line. Thanks also to Steve and others for all 
of the reviews they've provided.

 Incorporate new S3A FileSystem implementation
 -

 Key: HADOOP-10400
 URL: https://issues.apache.org/jira/browse/HADOOP-10400
 Project: Hadoop Common
  Issue Type: New Feature
  Components: fs, fs/s3
Affects Versions: 2.4.0
Reporter: Jordan Mendelson
Assignee: Jordan Mendelson
 Fix For: 2.6.0

 Attachments: HADOOP-10400-1.patch, HADOOP-10400-2.patch, 
 HADOOP-10400-3.patch, HADOOP-10400-4.patch, HADOOP-10400-5.patch, 
 HADOOP-10400-6.patch, HADOOP-10400-7.patch, HADOOP-10400-8-branch-2.patch, 
 HADOOP-10400-8.patch, HADOOP-10400-branch-2.patch


 The s3native filesystem has a number of limitations (some of which were 
 recently fixed by HADOOP-9454). This patch adds an s3a filesystem which uses 
 the aws-sdk instead of the jets3t library. There are a number of improvements 
 over s3native including:
 - Parallel copy (rename) support (dramatically speeds up commits on large 
 files)
 - AWS S3 explorer compatible empty directories files xyz/ instead of 
 xyz_$folder$ (reduces littering)
 - Ignores s3native created _$folder$ files created by s3native and other S3 
 browsing utilities
 - Supports multiple output buffer dirs to even out IO when uploading files
 - Supports IAM role-based authentication
 - Allows setting a default canned ACL for uploads (public, private, etc.)
 - Better error recovery handling
 - Should handle input seeks without having to download the whole file (used 
 for splits a lot)
 This code is a copy of https://github.com/Aloisius/hadoop-s3a with patches to 
 various pom files to get it to build against trunk. I've been using 0.0.1 in 
 production with CDH 4 for several months and CDH 5 for a few days. The 
 version here is 0.0.2 which changes around some keys to hopefully bring the 
 key name style more inline with the rest of hadoop 2.x.
 *Tunable parameters:*
 fs.s3a.access.key - Your AWS access key ID (omit for role authentication)
 fs.s3a.secret.key - Your AWS secret key (omit for role authentication)
 fs.s3a.connection.maximum - Controls how many parallel connections 
 HttpClient spawns (default: 15)
 fs.s3a.connection.ssl.enabled - Enables or disables SSL connections to S3 
 (default: true)
 fs.s3a.attempts.maximum - How many times we should retry commands on 
 transient errors (default: 10)
 fs.s3a.connection.timeout - Socket connect timeout (default: 5000)
 fs.s3a.paging.maximum - How many keys to request from S3 when doing 
 directory listings at a time (default: 5000)
 fs.s3a.multipart.size - How big (in bytes) to split a upload or copy 
 operation up into (default: 104857600)
 fs.s3a.multipart.threshold - Until a file is this large (in bytes), use 
 non-parallel upload (default: 2147483647)
 fs.s3a.acl.default - Set a canned ACL on newly created/copied objects 
 (private | public-read | public-read-write | authenticated-read | 
 log-delivery-write | bucket-owner-read | bucket-owner-full-control)
 fs.s3a.multipart.purge - True if you want to purge existing multipart 
 uploads that may not have been completed/aborted correctly (default: false)
 fs.s3a.multipart.purge.age - Minimum age in seconds of multipart uploads 
 to purge (default: 86400)
 fs.s3a.buffer.dir - Comma separated list of directories that will be used 
 to buffer file writes out of (default: uses ${hadoop.tmp.dir}/s3a )
 *Caveats*:
 Hadoop uses a standard output committer which uploads files as 
 filename.COPYING before renaming them. This can cause unnecessary performance 
 issues with S3 because it does not have a rename operation and S3 already 
 verifies uploads against an md5 that the driver sets on the upload request. 
 While this FileSystem should be significantly faster than the built-in 
 s3native driver because of parallel copy support, you may want to consider 
 setting a null output committer on our jobs to further improve performance.
 Because S3 requires the file length and MD5 to be known before a file is 
 uploaded, all output is buffered out to a temporary file first similar to the 
 s3native driver.
 Due to the lack of native rename() for S3, renaming extremely large files or 
 directories make take a while. Unfortunately, there is no way to notify 
 hadoop that progress is 

[jira] [Updated] (HADOOP-10400) Incorporate new S3A FileSystem implementation

2014-09-13 Thread David S. Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David S. Wang updated HADOOP-10400:
---
Attachment: HADOOP-10400-8-branch-2.patch

Here's a branch-2 backport for the latest patch HADOOP-10400-8.patch.

 Incorporate new S3A FileSystem implementation
 -

 Key: HADOOP-10400
 URL: https://issues.apache.org/jira/browse/HADOOP-10400
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs, fs/s3
Affects Versions: 2.4.0
Reporter: Jordan Mendelson
Assignee: Jordan Mendelson
 Attachments: HADOOP-10400-1.patch, HADOOP-10400-2.patch, 
 HADOOP-10400-3.patch, HADOOP-10400-4.patch, HADOOP-10400-5.patch, 
 HADOOP-10400-6.patch, HADOOP-10400-7.patch, HADOOP-10400-8-branch-2.patch, 
 HADOOP-10400-8.patch, HADOOP-10400-branch-2.patch


 The s3native filesystem has a number of limitations (some of which were 
 recently fixed by HADOOP-9454). This patch adds an s3a filesystem which uses 
 the aws-sdk instead of the jets3t library. There are a number of improvements 
 over s3native including:
 - Parallel copy (rename) support (dramatically speeds up commits on large 
 files)
 - AWS S3 explorer compatible empty directories files xyz/ instead of 
 xyz_$folder$ (reduces littering)
 - Ignores s3native created _$folder$ files created by s3native and other S3 
 browsing utilities
 - Supports multiple output buffer dirs to even out IO when uploading files
 - Supports IAM role-based authentication
 - Allows setting a default canned ACL for uploads (public, private, etc.)
 - Better error recovery handling
 - Should handle input seeks without having to download the whole file (used 
 for splits a lot)
 This code is a copy of https://github.com/Aloisius/hadoop-s3a with patches to 
 various pom files to get it to build against trunk. I've been using 0.0.1 in 
 production with CDH 4 for several months and CDH 5 for a few days. The 
 version here is 0.0.2 which changes around some keys to hopefully bring the 
 key name style more inline with the rest of hadoop 2.x.
 *Tunable parameters:*
 fs.s3a.access.key - Your AWS access key ID (omit for role authentication)
 fs.s3a.secret.key - Your AWS secret key (omit for role authentication)
 fs.s3a.connection.maximum - Controls how many parallel connections 
 HttpClient spawns (default: 15)
 fs.s3a.connection.ssl.enabled - Enables or disables SSL connections to S3 
 (default: true)
 fs.s3a.attempts.maximum - How many times we should retry commands on 
 transient errors (default: 10)
 fs.s3a.connection.timeout - Socket connect timeout (default: 5000)
 fs.s3a.paging.maximum - How many keys to request from S3 when doing 
 directory listings at a time (default: 5000)
 fs.s3a.multipart.size - How big (in bytes) to split a upload or copy 
 operation up into (default: 104857600)
 fs.s3a.multipart.threshold - Until a file is this large (in bytes), use 
 non-parallel upload (default: 2147483647)
 fs.s3a.acl.default - Set a canned ACL on newly created/copied objects 
 (private | public-read | public-read-write | authenticated-read | 
 log-delivery-write | bucket-owner-read | bucket-owner-full-control)
 fs.s3a.multipart.purge - True if you want to purge existing multipart 
 uploads that may not have been completed/aborted correctly (default: false)
 fs.s3a.multipart.purge.age - Minimum age in seconds of multipart uploads 
 to purge (default: 86400)
 fs.s3a.buffer.dir - Comma separated list of directories that will be used 
 to buffer file writes out of (default: uses ${hadoop.tmp.dir}/s3a )
 *Caveats*:
 Hadoop uses a standard output committer which uploads files as 
 filename.COPYING before renaming them. This can cause unnecessary performance 
 issues with S3 because it does not have a rename operation and S3 already 
 verifies uploads against an md5 that the driver sets on the upload request. 
 While this FileSystem should be significantly faster than the built-in 
 s3native driver because of parallel copy support, you may want to consider 
 setting a null output committer on our jobs to further improve performance.
 Because S3 requires the file length and MD5 to be known before a file is 
 uploaded, all output is buffered out to a temporary file first similar to the 
 s3native driver.
 Due to the lack of native rename() for S3, renaming extremely large files or 
 directories make take a while. Unfortunately, there is no way to notify 
 hadoop that progress is still being made for rename operations, so your job 
 may time out unless you increase the task timeout.
 This driver will fully ignore _$folder$ files. This was necessary so that it 
 could interoperate with repositories that have had the s3native driver used 
 on them, but means that it won't recognize empty directories that s3native 
 

[jira] [Updated] (HADOOP-10400) Incorporate new S3A FileSystem implementation

2014-09-12 Thread David S. Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David S. Wang updated HADOOP-10400:
---
Status: Open  (was: Patch Available)

 Incorporate new S3A FileSystem implementation
 -

 Key: HADOOP-10400
 URL: https://issues.apache.org/jira/browse/HADOOP-10400
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs, fs/s3
Affects Versions: 2.4.0
Reporter: Jordan Mendelson
Assignee: Jordan Mendelson
 Attachments: HADOOP-10400-1.patch, HADOOP-10400-2.patch, 
 HADOOP-10400-3.patch, HADOOP-10400-4.patch, HADOOP-10400-5.patch, 
 HADOOP-10400-6.patch, HADOOP-10400-7.patch, HADOOP-10400-branch-2.patch


 The s3native filesystem has a number of limitations (some of which were 
 recently fixed by HADOOP-9454). This patch adds an s3a filesystem which uses 
 the aws-sdk instead of the jets3t library. There are a number of improvements 
 over s3native including:
 - Parallel copy (rename) support (dramatically speeds up commits on large 
 files)
 - AWS S3 explorer compatible empty directories files xyz/ instead of 
 xyz_$folder$ (reduces littering)
 - Ignores s3native created _$folder$ files created by s3native and other S3 
 browsing utilities
 - Supports multiple output buffer dirs to even out IO when uploading files
 - Supports IAM role-based authentication
 - Allows setting a default canned ACL for uploads (public, private, etc.)
 - Better error recovery handling
 - Should handle input seeks without having to download the whole file (used 
 for splits a lot)
 This code is a copy of https://github.com/Aloisius/hadoop-s3a with patches to 
 various pom files to get it to build against trunk. I've been using 0.0.1 in 
 production with CDH 4 for several months and CDH 5 for a few days. The 
 version here is 0.0.2 which changes around some keys to hopefully bring the 
 key name style more inline with the rest of hadoop 2.x.
 *Tunable parameters:*
 fs.s3a.access.key - Your AWS access key ID (omit for role authentication)
 fs.s3a.secret.key - Your AWS secret key (omit for role authentication)
 fs.s3a.connection.maximum - Controls how many parallel connections 
 HttpClient spawns (default: 15)
 fs.s3a.connection.ssl.enabled - Enables or disables SSL connections to S3 
 (default: true)
 fs.s3a.attempts.maximum - How many times we should retry commands on 
 transient errors (default: 10)
 fs.s3a.connection.timeout - Socket connect timeout (default: 5000)
 fs.s3a.paging.maximum - How many keys to request from S3 when doing 
 directory listings at a time (default: 5000)
 fs.s3a.multipart.size - How big (in bytes) to split a upload or copy 
 operation up into (default: 104857600)
 fs.s3a.multipart.threshold - Until a file is this large (in bytes), use 
 non-parallel upload (default: 2147483647)
 fs.s3a.acl.default - Set a canned ACL on newly created/copied objects 
 (private | public-read | public-read-write | authenticated-read | 
 log-delivery-write | bucket-owner-read | bucket-owner-full-control)
 fs.s3a.multipart.purge - True if you want to purge existing multipart 
 uploads that may not have been completed/aborted correctly (default: false)
 fs.s3a.multipart.purge.age - Minimum age in seconds of multipart uploads 
 to purge (default: 86400)
 fs.s3a.buffer.dir - Comma separated list of directories that will be used 
 to buffer file writes out of (default: uses ${hadoop.tmp.dir}/s3a )
 *Caveats*:
 Hadoop uses a standard output committer which uploads files as 
 filename.COPYING before renaming them. This can cause unnecessary performance 
 issues with S3 because it does not have a rename operation and S3 already 
 verifies uploads against an md5 that the driver sets on the upload request. 
 While this FileSystem should be significantly faster than the built-in 
 s3native driver because of parallel copy support, you may want to consider 
 setting a null output committer on our jobs to further improve performance.
 Because S3 requires the file length and MD5 to be known before a file is 
 uploaded, all output is buffered out to a temporary file first similar to the 
 s3native driver.
 Due to the lack of native rename() for S3, renaming extremely large files or 
 directories make take a while. Unfortunately, there is no way to notify 
 hadoop that progress is still being made for rename operations, so your job 
 may time out unless you increase the task timeout.
 This driver will fully ignore _$folder$ files. This was necessary so that it 
 could interoperate with repositories that have had the s3native driver used 
 on them, but means that it won't recognize empty directories that s3native 
 has been used on.
 Statistics for the filesystem may be calculated differently than the s3native 
 filesystem. When uploading a 

[jira] [Updated] (HADOOP-10400) Incorporate new S3A FileSystem implementation

2014-09-12 Thread David S. Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David S. Wang updated HADOOP-10400:
---
Attachment: HADOOP-10400-8.patch

Thanks Steve for your comments.

I've attached a trunk patch to address your non-longer-term concerns.  I 
believe all of them are addressed.

I did the same testing as I mentioned as in previous comments.  In addition, 
all of the newly-added s3a FS contract tests pass.

 Incorporate new S3A FileSystem implementation
 -

 Key: HADOOP-10400
 URL: https://issues.apache.org/jira/browse/HADOOP-10400
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs, fs/s3
Affects Versions: 2.4.0
Reporter: Jordan Mendelson
Assignee: Jordan Mendelson
 Attachments: HADOOP-10400-1.patch, HADOOP-10400-2.patch, 
 HADOOP-10400-3.patch, HADOOP-10400-4.patch, HADOOP-10400-5.patch, 
 HADOOP-10400-6.patch, HADOOP-10400-7.patch, HADOOP-10400-8.patch, 
 HADOOP-10400-branch-2.patch


 The s3native filesystem has a number of limitations (some of which were 
 recently fixed by HADOOP-9454). This patch adds an s3a filesystem which uses 
 the aws-sdk instead of the jets3t library. There are a number of improvements 
 over s3native including:
 - Parallel copy (rename) support (dramatically speeds up commits on large 
 files)
 - AWS S3 explorer compatible empty directories files xyz/ instead of 
 xyz_$folder$ (reduces littering)
 - Ignores s3native created _$folder$ files created by s3native and other S3 
 browsing utilities
 - Supports multiple output buffer dirs to even out IO when uploading files
 - Supports IAM role-based authentication
 - Allows setting a default canned ACL for uploads (public, private, etc.)
 - Better error recovery handling
 - Should handle input seeks without having to download the whole file (used 
 for splits a lot)
 This code is a copy of https://github.com/Aloisius/hadoop-s3a with patches to 
 various pom files to get it to build against trunk. I've been using 0.0.1 in 
 production with CDH 4 for several months and CDH 5 for a few days. The 
 version here is 0.0.2 which changes around some keys to hopefully bring the 
 key name style more inline with the rest of hadoop 2.x.
 *Tunable parameters:*
 fs.s3a.access.key - Your AWS access key ID (omit for role authentication)
 fs.s3a.secret.key - Your AWS secret key (omit for role authentication)
 fs.s3a.connection.maximum - Controls how many parallel connections 
 HttpClient spawns (default: 15)
 fs.s3a.connection.ssl.enabled - Enables or disables SSL connections to S3 
 (default: true)
 fs.s3a.attempts.maximum - How many times we should retry commands on 
 transient errors (default: 10)
 fs.s3a.connection.timeout - Socket connect timeout (default: 5000)
 fs.s3a.paging.maximum - How many keys to request from S3 when doing 
 directory listings at a time (default: 5000)
 fs.s3a.multipart.size - How big (in bytes) to split a upload or copy 
 operation up into (default: 104857600)
 fs.s3a.multipart.threshold - Until a file is this large (in bytes), use 
 non-parallel upload (default: 2147483647)
 fs.s3a.acl.default - Set a canned ACL on newly created/copied objects 
 (private | public-read | public-read-write | authenticated-read | 
 log-delivery-write | bucket-owner-read | bucket-owner-full-control)
 fs.s3a.multipart.purge - True if you want to purge existing multipart 
 uploads that may not have been completed/aborted correctly (default: false)
 fs.s3a.multipart.purge.age - Minimum age in seconds of multipart uploads 
 to purge (default: 86400)
 fs.s3a.buffer.dir - Comma separated list of directories that will be used 
 to buffer file writes out of (default: uses ${hadoop.tmp.dir}/s3a )
 *Caveats*:
 Hadoop uses a standard output committer which uploads files as 
 filename.COPYING before renaming them. This can cause unnecessary performance 
 issues with S3 because it does not have a rename operation and S3 already 
 verifies uploads against an md5 that the driver sets on the upload request. 
 While this FileSystem should be significantly faster than the built-in 
 s3native driver because of parallel copy support, you may want to consider 
 setting a null output committer on our jobs to further improve performance.
 Because S3 requires the file length and MD5 to be known before a file is 
 uploaded, all output is buffered out to a temporary file first similar to the 
 s3native driver.
 Due to the lack of native rename() for S3, renaming extremely large files or 
 directories make take a while. Unfortunately, there is no way to notify 
 hadoop that progress is still being made for rename operations, so your job 
 may time out unless you increase the task timeout.
 This driver will fully ignore _$folder$ files. This was necessary so 

[jira] [Updated] (HADOOP-10400) Incorporate new S3A FileSystem implementation

2014-09-12 Thread David S. Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David S. Wang updated HADOOP-10400:
---
Status: Patch Available  (was: Open)

 Incorporate new S3A FileSystem implementation
 -

 Key: HADOOP-10400
 URL: https://issues.apache.org/jira/browse/HADOOP-10400
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs, fs/s3
Affects Versions: 2.4.0
Reporter: Jordan Mendelson
Assignee: Jordan Mendelson
 Attachments: HADOOP-10400-1.patch, HADOOP-10400-2.patch, 
 HADOOP-10400-3.patch, HADOOP-10400-4.patch, HADOOP-10400-5.patch, 
 HADOOP-10400-6.patch, HADOOP-10400-7.patch, HADOOP-10400-8.patch, 
 HADOOP-10400-branch-2.patch


 The s3native filesystem has a number of limitations (some of which were 
 recently fixed by HADOOP-9454). This patch adds an s3a filesystem which uses 
 the aws-sdk instead of the jets3t library. There are a number of improvements 
 over s3native including:
 - Parallel copy (rename) support (dramatically speeds up commits on large 
 files)
 - AWS S3 explorer compatible empty directories files xyz/ instead of 
 xyz_$folder$ (reduces littering)
 - Ignores s3native created _$folder$ files created by s3native and other S3 
 browsing utilities
 - Supports multiple output buffer dirs to even out IO when uploading files
 - Supports IAM role-based authentication
 - Allows setting a default canned ACL for uploads (public, private, etc.)
 - Better error recovery handling
 - Should handle input seeks without having to download the whole file (used 
 for splits a lot)
 This code is a copy of https://github.com/Aloisius/hadoop-s3a with patches to 
 various pom files to get it to build against trunk. I've been using 0.0.1 in 
 production with CDH 4 for several months and CDH 5 for a few days. The 
 version here is 0.0.2 which changes around some keys to hopefully bring the 
 key name style more inline with the rest of hadoop 2.x.
 *Tunable parameters:*
 fs.s3a.access.key - Your AWS access key ID (omit for role authentication)
 fs.s3a.secret.key - Your AWS secret key (omit for role authentication)
 fs.s3a.connection.maximum - Controls how many parallel connections 
 HttpClient spawns (default: 15)
 fs.s3a.connection.ssl.enabled - Enables or disables SSL connections to S3 
 (default: true)
 fs.s3a.attempts.maximum - How many times we should retry commands on 
 transient errors (default: 10)
 fs.s3a.connection.timeout - Socket connect timeout (default: 5000)
 fs.s3a.paging.maximum - How many keys to request from S3 when doing 
 directory listings at a time (default: 5000)
 fs.s3a.multipart.size - How big (in bytes) to split a upload or copy 
 operation up into (default: 104857600)
 fs.s3a.multipart.threshold - Until a file is this large (in bytes), use 
 non-parallel upload (default: 2147483647)
 fs.s3a.acl.default - Set a canned ACL on newly created/copied objects 
 (private | public-read | public-read-write | authenticated-read | 
 log-delivery-write | bucket-owner-read | bucket-owner-full-control)
 fs.s3a.multipart.purge - True if you want to purge existing multipart 
 uploads that may not have been completed/aborted correctly (default: false)
 fs.s3a.multipart.purge.age - Minimum age in seconds of multipart uploads 
 to purge (default: 86400)
 fs.s3a.buffer.dir - Comma separated list of directories that will be used 
 to buffer file writes out of (default: uses ${hadoop.tmp.dir}/s3a )
 *Caveats*:
 Hadoop uses a standard output committer which uploads files as 
 filename.COPYING before renaming them. This can cause unnecessary performance 
 issues with S3 because it does not have a rename operation and S3 already 
 verifies uploads against an md5 that the driver sets on the upload request. 
 While this FileSystem should be significantly faster than the built-in 
 s3native driver because of parallel copy support, you may want to consider 
 setting a null output committer on our jobs to further improve performance.
 Because S3 requires the file length and MD5 to be known before a file is 
 uploaded, all output is buffered out to a temporary file first similar to the 
 s3native driver.
 Due to the lack of native rename() for S3, renaming extremely large files or 
 directories make take a while. Unfortunately, there is no way to notify 
 hadoop that progress is still being made for rename operations, so your job 
 may time out unless you increase the task timeout.
 This driver will fully ignore _$folder$ files. This was necessary so that it 
 could interoperate with repositories that have had the s3native driver used 
 on them, but means that it won't recognize empty directories that s3native 
 has been used on.
 Statistics for the filesystem may be calculated differently than the s3native 
 

[jira] [Updated] (HADOOP-10400) Incorporate new S3A FileSystem implementation

2014-09-11 Thread David S. Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David S. Wang updated HADOOP-10400:
---
Status: Open  (was: Patch Available)

 Incorporate new S3A FileSystem implementation
 -

 Key: HADOOP-10400
 URL: https://issues.apache.org/jira/browse/HADOOP-10400
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs, fs/s3
Affects Versions: 2.4.0
Reporter: Jordan Mendelson
Assignee: Jordan Mendelson
 Attachments: HADOOP-10400-1.patch, HADOOP-10400-2.patch, 
 HADOOP-10400-3.patch, HADOOP-10400-4.patch, HADOOP-10400-5.patch, 
 HADOOP-10400-6.patch


 The s3native filesystem has a number of limitations (some of which were 
 recently fixed by HADOOP-9454). This patch adds an s3a filesystem which uses 
 the aws-sdk instead of the jets3t library. There are a number of improvements 
 over s3native including:
 - Parallel copy (rename) support (dramatically speeds up commits on large 
 files)
 - AWS S3 explorer compatible empty directories files xyz/ instead of 
 xyz_$folder$ (reduces littering)
 - Ignores s3native created _$folder$ files created by s3native and other S3 
 browsing utilities
 - Supports multiple output buffer dirs to even out IO when uploading files
 - Supports IAM role-based authentication
 - Allows setting a default canned ACL for uploads (public, private, etc.)
 - Better error recovery handling
 - Should handle input seeks without having to download the whole file (used 
 for splits a lot)
 This code is a copy of https://github.com/Aloisius/hadoop-s3a with patches to 
 various pom files to get it to build against trunk. I've been using 0.0.1 in 
 production with CDH 4 for several months and CDH 5 for a few days. The 
 version here is 0.0.2 which changes around some keys to hopefully bring the 
 key name style more inline with the rest of hadoop 2.x.
 *Tunable parameters:*
 fs.s3a.access.key - Your AWS access key ID (omit for role authentication)
 fs.s3a.secret.key - Your AWS secret key (omit for role authentication)
 fs.s3a.connection.maximum - Controls how many parallel connections 
 HttpClient spawns (default: 15)
 fs.s3a.connection.ssl.enabled - Enables or disables SSL connections to S3 
 (default: true)
 fs.s3a.attempts.maximum - How many times we should retry commands on 
 transient errors (default: 10)
 fs.s3a.connection.timeout - Socket connect timeout (default: 5000)
 fs.s3a.paging.maximum - How many keys to request from S3 when doing 
 directory listings at a time (default: 5000)
 fs.s3a.multipart.size - How big (in bytes) to split a upload or copy 
 operation up into (default: 104857600)
 fs.s3a.multipart.threshold - Until a file is this large (in bytes), use 
 non-parallel upload (default: 2147483647)
 fs.s3a.acl.default - Set a canned ACL on newly created/copied objects 
 (private | public-read | public-read-write | authenticated-read | 
 log-delivery-write | bucket-owner-read | bucket-owner-full-control)
 fs.s3a.multipart.purge - True if you want to purge existing multipart 
 uploads that may not have been completed/aborted correctly (default: false)
 fs.s3a.multipart.purge.age - Minimum age in seconds of multipart uploads 
 to purge (default: 86400)
 fs.s3a.buffer.dir - Comma separated list of directories that will be used 
 to buffer file writes out of (default: uses ${hadoop.tmp.dir}/s3a )
 *Caveats*:
 Hadoop uses a standard output committer which uploads files as 
 filename.COPYING before renaming them. This can cause unnecessary performance 
 issues with S3 because it does not have a rename operation and S3 already 
 verifies uploads against an md5 that the driver sets on the upload request. 
 While this FileSystem should be significantly faster than the built-in 
 s3native driver because of parallel copy support, you may want to consider 
 setting a null output committer on our jobs to further improve performance.
 Because S3 requires the file length and MD5 to be known before a file is 
 uploaded, all output is buffered out to a temporary file first similar to the 
 s3native driver.
 Due to the lack of native rename() for S3, renaming extremely large files or 
 directories make take a while. Unfortunately, there is no way to notify 
 hadoop that progress is still being made for rename operations, so your job 
 may time out unless you increase the task timeout.
 This driver will fully ignore _$folder$ files. This was necessary so that it 
 could interoperate with repositories that have had the s3native driver used 
 on them, but means that it won't recognize empty directories that s3native 
 has been used on.
 Statistics for the filesystem may be calculated differently than the s3native 
 filesystem. When uploading a file, we do not count writing the temporary file 
 on 

[jira] [Updated] (HADOOP-10400) Incorporate new S3A FileSystem implementation

2014-09-11 Thread David S. Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David S. Wang updated HADOOP-10400:
---
Attachment: HADOOP-10400-7.patch

HADOOP-10400-7.patch does the following:

* Rebases HADOOP-10400-6.patch onto current tip-of-trunk, which includes the 
changes to HADOOP-11074 to move the s3 connector bits over to hadoop-aws.
* Incorporates HADOOP-10675, HADOOP-10676, and HADOOP-10677, which were fixes 
on top of previous candidate HADOOP-10400 patches.
* Corrects jackson 2 dependencies used by the hadoop-aws and hadoop-azure 
modules.

With regards to testing, I ran mvn clean install -Pnative -DskipTests from 
top-level, and mvn test in both hadoop-aws and hadoop-azure directories.

 Incorporate new S3A FileSystem implementation
 -

 Key: HADOOP-10400
 URL: https://issues.apache.org/jira/browse/HADOOP-10400
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs, fs/s3
Affects Versions: 2.4.0
Reporter: Jordan Mendelson
Assignee: Jordan Mendelson
 Attachments: HADOOP-10400-1.patch, HADOOP-10400-2.patch, 
 HADOOP-10400-3.patch, HADOOP-10400-4.patch, HADOOP-10400-5.patch, 
 HADOOP-10400-6.patch, HADOOP-10400-7.patch


 The s3native filesystem has a number of limitations (some of which were 
 recently fixed by HADOOP-9454). This patch adds an s3a filesystem which uses 
 the aws-sdk instead of the jets3t library. There are a number of improvements 
 over s3native including:
 - Parallel copy (rename) support (dramatically speeds up commits on large 
 files)
 - AWS S3 explorer compatible empty directories files xyz/ instead of 
 xyz_$folder$ (reduces littering)
 - Ignores s3native created _$folder$ files created by s3native and other S3 
 browsing utilities
 - Supports multiple output buffer dirs to even out IO when uploading files
 - Supports IAM role-based authentication
 - Allows setting a default canned ACL for uploads (public, private, etc.)
 - Better error recovery handling
 - Should handle input seeks without having to download the whole file (used 
 for splits a lot)
 This code is a copy of https://github.com/Aloisius/hadoop-s3a with patches to 
 various pom files to get it to build against trunk. I've been using 0.0.1 in 
 production with CDH 4 for several months and CDH 5 for a few days. The 
 version here is 0.0.2 which changes around some keys to hopefully bring the 
 key name style more inline with the rest of hadoop 2.x.
 *Tunable parameters:*
 fs.s3a.access.key - Your AWS access key ID (omit for role authentication)
 fs.s3a.secret.key - Your AWS secret key (omit for role authentication)
 fs.s3a.connection.maximum - Controls how many parallel connections 
 HttpClient spawns (default: 15)
 fs.s3a.connection.ssl.enabled - Enables or disables SSL connections to S3 
 (default: true)
 fs.s3a.attempts.maximum - How many times we should retry commands on 
 transient errors (default: 10)
 fs.s3a.connection.timeout - Socket connect timeout (default: 5000)
 fs.s3a.paging.maximum - How many keys to request from S3 when doing 
 directory listings at a time (default: 5000)
 fs.s3a.multipart.size - How big (in bytes) to split a upload or copy 
 operation up into (default: 104857600)
 fs.s3a.multipart.threshold - Until a file is this large (in bytes), use 
 non-parallel upload (default: 2147483647)
 fs.s3a.acl.default - Set a canned ACL on newly created/copied objects 
 (private | public-read | public-read-write | authenticated-read | 
 log-delivery-write | bucket-owner-read | bucket-owner-full-control)
 fs.s3a.multipart.purge - True if you want to purge existing multipart 
 uploads that may not have been completed/aborted correctly (default: false)
 fs.s3a.multipart.purge.age - Minimum age in seconds of multipart uploads 
 to purge (default: 86400)
 fs.s3a.buffer.dir - Comma separated list of directories that will be used 
 to buffer file writes out of (default: uses ${hadoop.tmp.dir}/s3a )
 *Caveats*:
 Hadoop uses a standard output committer which uploads files as 
 filename.COPYING before renaming them. This can cause unnecessary performance 
 issues with S3 because it does not have a rename operation and S3 already 
 verifies uploads against an md5 that the driver sets on the upload request. 
 While this FileSystem should be significantly faster than the built-in 
 s3native driver because of parallel copy support, you may want to consider 
 setting a null output committer on our jobs to further improve performance.
 Because S3 requires the file length and MD5 to be known before a file is 
 uploaded, all output is buffered out to a temporary file first similar to the 
 s3native driver.
 Due to the lack of native rename() for S3, renaming extremely large files or 
 directories make take a while. 

[jira] [Updated] (HADOOP-10400) Incorporate new S3A FileSystem implementation

2014-09-11 Thread David S. Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David S. Wang updated HADOOP-10400:
---
Target Version/s: 3.0.0
  Status: Patch Available  (was: Open)

 Incorporate new S3A FileSystem implementation
 -

 Key: HADOOP-10400
 URL: https://issues.apache.org/jira/browse/HADOOP-10400
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs, fs/s3
Affects Versions: 2.4.0
Reporter: Jordan Mendelson
Assignee: Jordan Mendelson
 Attachments: HADOOP-10400-1.patch, HADOOP-10400-2.patch, 
 HADOOP-10400-3.patch, HADOOP-10400-4.patch, HADOOP-10400-5.patch, 
 HADOOP-10400-6.patch, HADOOP-10400-7.patch


 The s3native filesystem has a number of limitations (some of which were 
 recently fixed by HADOOP-9454). This patch adds an s3a filesystem which uses 
 the aws-sdk instead of the jets3t library. There are a number of improvements 
 over s3native including:
 - Parallel copy (rename) support (dramatically speeds up commits on large 
 files)
 - AWS S3 explorer compatible empty directories files xyz/ instead of 
 xyz_$folder$ (reduces littering)
 - Ignores s3native created _$folder$ files created by s3native and other S3 
 browsing utilities
 - Supports multiple output buffer dirs to even out IO when uploading files
 - Supports IAM role-based authentication
 - Allows setting a default canned ACL for uploads (public, private, etc.)
 - Better error recovery handling
 - Should handle input seeks without having to download the whole file (used 
 for splits a lot)
 This code is a copy of https://github.com/Aloisius/hadoop-s3a with patches to 
 various pom files to get it to build against trunk. I've been using 0.0.1 in 
 production with CDH 4 for several months and CDH 5 for a few days. The 
 version here is 0.0.2 which changes around some keys to hopefully bring the 
 key name style more inline with the rest of hadoop 2.x.
 *Tunable parameters:*
 fs.s3a.access.key - Your AWS access key ID (omit for role authentication)
 fs.s3a.secret.key - Your AWS secret key (omit for role authentication)
 fs.s3a.connection.maximum - Controls how many parallel connections 
 HttpClient spawns (default: 15)
 fs.s3a.connection.ssl.enabled - Enables or disables SSL connections to S3 
 (default: true)
 fs.s3a.attempts.maximum - How many times we should retry commands on 
 transient errors (default: 10)
 fs.s3a.connection.timeout - Socket connect timeout (default: 5000)
 fs.s3a.paging.maximum - How many keys to request from S3 when doing 
 directory listings at a time (default: 5000)
 fs.s3a.multipart.size - How big (in bytes) to split a upload or copy 
 operation up into (default: 104857600)
 fs.s3a.multipart.threshold - Until a file is this large (in bytes), use 
 non-parallel upload (default: 2147483647)
 fs.s3a.acl.default - Set a canned ACL on newly created/copied objects 
 (private | public-read | public-read-write | authenticated-read | 
 log-delivery-write | bucket-owner-read | bucket-owner-full-control)
 fs.s3a.multipart.purge - True if you want to purge existing multipart 
 uploads that may not have been completed/aborted correctly (default: false)
 fs.s3a.multipart.purge.age - Minimum age in seconds of multipart uploads 
 to purge (default: 86400)
 fs.s3a.buffer.dir - Comma separated list of directories that will be used 
 to buffer file writes out of (default: uses ${hadoop.tmp.dir}/s3a )
 *Caveats*:
 Hadoop uses a standard output committer which uploads files as 
 filename.COPYING before renaming them. This can cause unnecessary performance 
 issues with S3 because it does not have a rename operation and S3 already 
 verifies uploads against an md5 that the driver sets on the upload request. 
 While this FileSystem should be significantly faster than the built-in 
 s3native driver because of parallel copy support, you may want to consider 
 setting a null output committer on our jobs to further improve performance.
 Because S3 requires the file length and MD5 to be known before a file is 
 uploaded, all output is buffered out to a temporary file first similar to the 
 s3native driver.
 Due to the lack of native rename() for S3, renaming extremely large files or 
 directories make take a while. Unfortunately, there is no way to notify 
 hadoop that progress is still being made for rename operations, so your job 
 may time out unless you increase the task timeout.
 This driver will fully ignore _$folder$ files. This was necessary so that it 
 could interoperate with repositories that have had the s3native driver used 
 on them, but means that it won't recognize empty directories that s3native 
 has been used on.
 Statistics for the filesystem may be calculated differently than the s3native 
 filesystem. When 

[jira] [Updated] (HADOOP-10400) Incorporate new S3A FileSystem implementation

2014-09-11 Thread David S. Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David S. Wang updated HADOOP-10400:
---
Attachment: HADOOP-10400-branch-2.patch

This is the branch-2 backport for HADOOP-10400.  It relies on HADOOP-11074 to 
be applied first.  It applied almost entirely cleanly, except for a reference 
to hadoop-azure in a POM file, which is not in branch-2.

 Incorporate new S3A FileSystem implementation
 -

 Key: HADOOP-10400
 URL: https://issues.apache.org/jira/browse/HADOOP-10400
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs, fs/s3
Affects Versions: 2.4.0
Reporter: Jordan Mendelson
Assignee: Jordan Mendelson
 Attachments: HADOOP-10400-1.patch, HADOOP-10400-2.patch, 
 HADOOP-10400-3.patch, HADOOP-10400-4.patch, HADOOP-10400-5.patch, 
 HADOOP-10400-6.patch, HADOOP-10400-7.patch, HADOOP-10400-branch-2.patch


 The s3native filesystem has a number of limitations (some of which were 
 recently fixed by HADOOP-9454). This patch adds an s3a filesystem which uses 
 the aws-sdk instead of the jets3t library. There are a number of improvements 
 over s3native including:
 - Parallel copy (rename) support (dramatically speeds up commits on large 
 files)
 - AWS S3 explorer compatible empty directories files xyz/ instead of 
 xyz_$folder$ (reduces littering)
 - Ignores s3native created _$folder$ files created by s3native and other S3 
 browsing utilities
 - Supports multiple output buffer dirs to even out IO when uploading files
 - Supports IAM role-based authentication
 - Allows setting a default canned ACL for uploads (public, private, etc.)
 - Better error recovery handling
 - Should handle input seeks without having to download the whole file (used 
 for splits a lot)
 This code is a copy of https://github.com/Aloisius/hadoop-s3a with patches to 
 various pom files to get it to build against trunk. I've been using 0.0.1 in 
 production with CDH 4 for several months and CDH 5 for a few days. The 
 version here is 0.0.2 which changes around some keys to hopefully bring the 
 key name style more inline with the rest of hadoop 2.x.
 *Tunable parameters:*
 fs.s3a.access.key - Your AWS access key ID (omit for role authentication)
 fs.s3a.secret.key - Your AWS secret key (omit for role authentication)
 fs.s3a.connection.maximum - Controls how many parallel connections 
 HttpClient spawns (default: 15)
 fs.s3a.connection.ssl.enabled - Enables or disables SSL connections to S3 
 (default: true)
 fs.s3a.attempts.maximum - How many times we should retry commands on 
 transient errors (default: 10)
 fs.s3a.connection.timeout - Socket connect timeout (default: 5000)
 fs.s3a.paging.maximum - How many keys to request from S3 when doing 
 directory listings at a time (default: 5000)
 fs.s3a.multipart.size - How big (in bytes) to split a upload or copy 
 operation up into (default: 104857600)
 fs.s3a.multipart.threshold - Until a file is this large (in bytes), use 
 non-parallel upload (default: 2147483647)
 fs.s3a.acl.default - Set a canned ACL on newly created/copied objects 
 (private | public-read | public-read-write | authenticated-read | 
 log-delivery-write | bucket-owner-read | bucket-owner-full-control)
 fs.s3a.multipart.purge - True if you want to purge existing multipart 
 uploads that may not have been completed/aborted correctly (default: false)
 fs.s3a.multipart.purge.age - Minimum age in seconds of multipart uploads 
 to purge (default: 86400)
 fs.s3a.buffer.dir - Comma separated list of directories that will be used 
 to buffer file writes out of (default: uses ${hadoop.tmp.dir}/s3a )
 *Caveats*:
 Hadoop uses a standard output committer which uploads files as 
 filename.COPYING before renaming them. This can cause unnecessary performance 
 issues with S3 because it does not have a rename operation and S3 already 
 verifies uploads against an md5 that the driver sets on the upload request. 
 While this FileSystem should be significantly faster than the built-in 
 s3native driver because of parallel copy support, you may want to consider 
 setting a null output committer on our jobs to further improve performance.
 Because S3 requires the file length and MD5 to be known before a file is 
 uploaded, all output is buffered out to a temporary file first similar to the 
 s3native driver.
 Due to the lack of native rename() for S3, renaming extremely large files or 
 directories make take a while. Unfortunately, there is no way to notify 
 hadoop that progress is still being made for rename operations, so your job 
 may time out unless you increase the task timeout.
 This driver will fully ignore _$folder$ files. This was necessary so that it 
 could interoperate with repositories that have had the s3native 

[jira] [Updated] (HADOOP-10400) Incorporate new S3A FileSystem implementation

2014-07-01 Thread Matteo Bertozzi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matteo Bertozzi updated HADOOP-10400:
-

Attachment: HADOOP-10400-6.patch

attached v6, which is the same as v5 but fixes the fs.open().close() case.
The wrappedObject is initialized only inside the read, so calling close() 
before a read will throw an NPE. testInputStreamClosedTwice() should reproduce 
the problem, since is doing fs.open().close()
{code}
@@ -175,7 +175,9 @@ public class S3AInputStream extends FSInputStream {
 }
 super.close();
 closed = true;
-wrappedObject.close();
+if (wrappedObject != null) {
+  wrappedObject.close();
+}
   }
{code}

 Incorporate new S3A FileSystem implementation
 -

 Key: HADOOP-10400
 URL: https://issues.apache.org/jira/browse/HADOOP-10400
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs, fs/s3
Affects Versions: 2.4.0
Reporter: Jordan Mendelson
Assignee: Jordan Mendelson
 Attachments: HADOOP-10400-1.patch, HADOOP-10400-2.patch, 
 HADOOP-10400-3.patch, HADOOP-10400-4.patch, HADOOP-10400-5.patch, 
 HADOOP-10400-6.patch


 The s3native filesystem has a number of limitations (some of which were 
 recently fixed by HADOOP-9454). This patch adds an s3a filesystem which uses 
 the aws-sdk instead of the jets3t library. There are a number of improvements 
 over s3native including:
 - Parallel copy (rename) support (dramatically speeds up commits on large 
 files)
 - AWS S3 explorer compatible empty directories files xyz/ instead of 
 xyz_$folder$ (reduces littering)
 - Ignores s3native created _$folder$ files created by s3native and other S3 
 browsing utilities
 - Supports multiple output buffer dirs to even out IO when uploading files
 - Supports IAM role-based authentication
 - Allows setting a default canned ACL for uploads (public, private, etc.)
 - Better error recovery handling
 - Should handle input seeks without having to download the whole file (used 
 for splits a lot)
 This code is a copy of https://github.com/Aloisius/hadoop-s3a with patches to 
 various pom files to get it to build against trunk. I've been using 0.0.1 in 
 production with CDH 4 for several months and CDH 5 for a few days. The 
 version here is 0.0.2 which changes around some keys to hopefully bring the 
 key name style more inline with the rest of hadoop 2.x.
 *Tunable parameters:*
 fs.s3a.access.key - Your AWS access key ID (omit for role authentication)
 fs.s3a.secret.key - Your AWS secret key (omit for role authentication)
 fs.s3a.connection.maximum - Controls how many parallel connections 
 HttpClient spawns (default: 15)
 fs.s3a.connection.ssl.enabled - Enables or disables SSL connections to S3 
 (default: true)
 fs.s3a.attempts.maximum - How many times we should retry commands on 
 transient errors (default: 10)
 fs.s3a.connection.timeout - Socket connect timeout (default: 5000)
 fs.s3a.paging.maximum - How many keys to request from S3 when doing 
 directory listings at a time (default: 5000)
 fs.s3a.multipart.size - How big (in bytes) to split a upload or copy 
 operation up into (default: 104857600)
 fs.s3a.multipart.threshold - Until a file is this large (in bytes), use 
 non-parallel upload (default: 2147483647)
 fs.s3a.acl.default - Set a canned ACL on newly created/copied objects 
 (private | public-read | public-read-write | authenticated-read | 
 log-delivery-write | bucket-owner-read | bucket-owner-full-control)
 fs.s3a.multipart.purge - True if you want to purge existing multipart 
 uploads that may not have been completed/aborted correctly (default: false)
 fs.s3a.multipart.purge.age - Minimum age in seconds of multipart uploads 
 to purge (default: 86400)
 fs.s3a.buffer.dir - Comma separated list of directories that will be used 
 to buffer file writes out of (default: uses ${hadoop.tmp.dir}/s3a )
 *Caveats*:
 Hadoop uses a standard output committer which uploads files as 
 filename.COPYING before renaming them. This can cause unnecessary performance 
 issues with S3 because it does not have a rename operation and S3 already 
 verifies uploads against an md5 that the driver sets on the upload request. 
 While this FileSystem should be significantly faster than the built-in 
 s3native driver because of parallel copy support, you may want to consider 
 setting a null output committer on our jobs to further improve performance.
 Because S3 requires the file length and MD5 to be known before a file is 
 uploaded, all output is buffered out to a temporary file first similar to the 
 s3native driver.
 Due to the lack of native rename() for S3, renaming extremely large files or 
 directories make take a while. Unfortunately, there is no way to notify 
 hadoop that progress is still 

[jira] [Updated] (HADOOP-10400) Incorporate new S3A FileSystem implementation

2014-05-30 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-10400:


  Component/s: fs/s3
Affects Version/s: 2.4.0

 Incorporate new S3A FileSystem implementation
 -

 Key: HADOOP-10400
 URL: https://issues.apache.org/jira/browse/HADOOP-10400
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs, fs/s3
Affects Versions: 2.4.0
Reporter: Jordan Mendelson
Assignee: Jordan Mendelson
 Attachments: HADOOP-10400-1.patch, HADOOP-10400-2.patch, 
 HADOOP-10400-3.patch, HADOOP-10400-4.patch, HADOOP-10400-5.patch


 The s3native filesystem has a number of limitations (some of which were 
 recently fixed by HADOOP-9454). This patch adds an s3a filesystem which uses 
 the aws-sdk instead of the jets3t library. There are a number of improvements 
 over s3native including:
 - Parallel copy (rename) support (dramatically speeds up commits on large 
 files)
 - AWS S3 explorer compatible empty directories files xyz/ instead of 
 xyz_$folder$ (reduces littering)
 - Ignores s3native created _$folder$ files created by s3native and other S3 
 browsing utilities
 - Supports multiple output buffer dirs to even out IO when uploading files
 - Supports IAM role-based authentication
 - Allows setting a default canned ACL for uploads (public, private, etc.)
 - Better error recovery handling
 - Should handle input seeks without having to download the whole file (used 
 for splits a lot)
 This code is a copy of https://github.com/Aloisius/hadoop-s3a with patches to 
 various pom files to get it to build against trunk. I've been using 0.0.1 in 
 production with CDH 4 for several months and CDH 5 for a few days. The 
 version here is 0.0.2 which changes around some keys to hopefully bring the 
 key name style more inline with the rest of hadoop 2.x.
 *Tunable parameters:*
 fs.s3a.access.key - Your AWS access key ID (omit for role authentication)
 fs.s3a.secret.key - Your AWS secret key (omit for role authentication)
 fs.s3a.connection.maximum - Controls how many parallel connections 
 HttpClient spawns (default: 15)
 fs.s3a.connection.ssl.enabled - Enables or disables SSL connections to S3 
 (default: true)
 fs.s3a.attempts.maximum - How many times we should retry commands on 
 transient errors (default: 10)
 fs.s3a.connection.timeout - Socket connect timeout (default: 5000)
 fs.s3a.paging.maximum - How many keys to request from S3 when doing 
 directory listings at a time (default: 5000)
 fs.s3a.multipart.size - How big (in bytes) to split a upload or copy 
 operation up into (default: 104857600)
 fs.s3a.multipart.threshold - Until a file is this large (in bytes), use 
 non-parallel upload (default: 2147483647)
 fs.s3a.acl.default - Set a canned ACL on newly created/copied objects 
 (private | public-read | public-read-write | authenticated-read | 
 log-delivery-write | bucket-owner-read | bucket-owner-full-control)
 fs.s3a.multipart.purge - True if you want to purge existing multipart 
 uploads that may not have been completed/aborted correctly (default: false)
 fs.s3a.multipart.purge.age - Minimum age in seconds of multipart uploads 
 to purge (default: 86400)
 fs.s3a.buffer.dir - Comma separated list of directories that will be used 
 to buffer file writes out of (default: uses ${hadoop.tmp.dir}/s3a )
 *Caveats*:
 Hadoop uses a standard output committer which uploads files as 
 filename.COPYING before renaming them. This can cause unnecessary performance 
 issues with S3 because it does not have a rename operation and S3 already 
 verifies uploads against an md5 that the driver sets on the upload request. 
 While this FileSystem should be significantly faster than the built-in 
 s3native driver because of parallel copy support, you may want to consider 
 setting a null output committer on our jobs to further improve performance.
 Because S3 requires the file length and MD5 to be known before a file is 
 uploaded, all output is buffered out to a temporary file first similar to the 
 s3native driver.
 Due to the lack of native rename() for S3, renaming extremely large files or 
 directories make take a while. Unfortunately, there is no way to notify 
 hadoop that progress is still being made for rename operations, so your job 
 may time out unless you increase the task timeout.
 This driver will fully ignore _$folder$ files. This was necessary so that it 
 could interoperate with repositories that have had the s3native driver used 
 on them, but means that it won't recognize empty directories that s3native 
 has been used on.
 Statistics for the filesystem may be calculated differently than the s3native 
 filesystem. When uploading a file, we do not count writing the temporary file 
 on the 

[jira] [Updated] (HADOOP-10400) Incorporate new S3A FileSystem implementation

2014-03-19 Thread Jordan Mendelson (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jordan Mendelson updated HADOOP-10400:
--

Attachment: HADOOP-10400-5.patch

This new version (-5) adjusts the test to use FileSystemContractBaseTest. 

However because FileSystemContractBaseTest uses the JUnit 3 runner, we can't 
use assume*() functions to skip the tests if we don't have a valid URL to test 
against, so the test has been renamed from TestS3AFileSystem to 
S3AFileSystemContractBaseTest and is consequently no longer run by default. A 
better solution would be to update FileSystemContractBaseTest for JUnit 4, but 
there are a lot of dependencies and that seems outside the scope of this patch.

I'm not entirely sure why the linter is complaining about these patches. The 
last one complained about code completely outside of mine. Odd.

 Incorporate new S3A FileSystem implementation
 -

 Key: HADOOP-10400
 URL: https://issues.apache.org/jira/browse/HADOOP-10400
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs
Reporter: Jordan Mendelson
Assignee: Jordan Mendelson
 Attachments: HADOOP-10400-1.patch, HADOOP-10400-2.patch, 
 HADOOP-10400-3.patch, HADOOP-10400-4.patch, HADOOP-10400-5.patch


 The s3native filesystem has a number of limitations (some of which were 
 recently fixed by HADOOP-9454). This patch adds an s3a filesystem which uses 
 the aws-sdk instead of the jets3t library. There are a number of improvements 
 over s3native including:
 - Parallel copy (rename) support (dramatically speeds up commits on large 
 files)
 - AWS S3 explorer compatible empty directories files xyz/ instead of 
 xyz_$folder$ (reduces littering)
 - Ignores s3native created _$folder$ files created by s3native and other S3 
 browsing utilities
 - Supports multiple output buffer dirs to even out IO when uploading files
 - Supports IAM role-based authentication
 - Allows setting a default canned ACL for uploads (public, private, etc.)
 - Better error recovery handling
 - Should handle input seeks without having to download the whole file (used 
 for splits a lot)
 This code is a copy of https://github.com/Aloisius/hadoop-s3a with patches to 
 various pom files to get it to build against trunk. I've been using 0.0.1 in 
 production with CDH 4 for several months and CDH 5 for a few days. The 
 version here is 0.0.2 which changes around some keys to hopefully bring the 
 key name style more inline with the rest of hadoop 2.x.
 *Tunable parameters:*
 fs.s3a.access.key - Your AWS access key ID (omit for role authentication)
 fs.s3a.secret.key - Your AWS secret key (omit for role authentication)
 fs.s3a.connection.maximum - Controls how many parallel connections 
 HttpClient spawns (default: 15)
 fs.s3a.connection.ssl.enabled - Enables or disables SSL connections to S3 
 (default: true)
 fs.s3a.attempts.maximum - How many times we should retry commands on 
 transient errors (default: 10)
 fs.s3a.connection.timeout - Socket connect timeout (default: 5000)
 fs.s3a.paging.maximum - How many keys to request from S3 when doing 
 directory listings at a time (default: 5000)
 fs.s3a.multipart.size - How big (in bytes) to split a upload or copy 
 operation up into (default: 104857600)
 fs.s3a.multipart.threshold - Until a file is this large (in bytes), use 
 non-parallel upload (default: 2147483647)
 fs.s3a.acl.default - Set a canned ACL on newly created/copied objects 
 (private | public-read | public-read-write | authenticated-read | 
 log-delivery-write | bucket-owner-read | bucket-owner-full-control)
 fs.s3a.multipart.purge - True if you want to purge existing multipart 
 uploads that may not have been completed/aborted correctly (default: false)
 fs.s3a.multipart.purge.age - Minimum age in seconds of multipart uploads 
 to purge (default: 86400)
 fs.s3a.buffer.dir - Comma separated list of directories that will be used 
 to buffer file writes out of (default: uses ${hadoop.tmp.dir}/s3a )
 *Caveats*:
 Hadoop uses a standard output committer which uploads files as 
 filename.COPYING before renaming them. This can cause unnecessary performance 
 issues with S3 because it does not have a rename operation and S3 already 
 verifies uploads against an md5 that the driver sets on the upload request. 
 While this FileSystem should be significantly faster than the built-in 
 s3native driver because of parallel copy support, you may want to consider 
 setting a null output committer on our jobs to further improve performance.
 Because S3 requires the file length and MD5 to be known before a file is 
 uploaded, all output is buffered out to a temporary file first similar to the 
 s3native driver.
 Due to the lack of native rename() for S3, renaming extremely large 

[jira] [Updated] (HADOOP-10400) Incorporate new S3A FileSystem implementation

2014-03-11 Thread Jordan Mendelson (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jordan Mendelson updated HADOOP-10400:
--

Attachment: HADOOP-10400-4.patch

 Incorporate new S3A FileSystem implementation
 -

 Key: HADOOP-10400
 URL: https://issues.apache.org/jira/browse/HADOOP-10400
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs
Reporter: Jordan Mendelson
 Attachments: HADOOP-10400-1.patch, HADOOP-10400-2.patch, 
 HADOOP-10400-3.patch, HADOOP-10400-4.patch


 The s3native filesystem has a number of limitations (some of which were 
 recently fixed by HADOOP-9454). This patch adds an s3a filesystem which uses 
 the aws-sdk instead of the jets3t library. There are a number of improvements 
 over s3native including:
 - Parallel copy (rename) support (dramatically speeds up commits on large 
 files)
 - AWS S3 explorer compatible empty directories files xyz/ instead of 
 xyz_$folder$ (reduces littering)
 - Ignores s3native created _$folder$ files created by s3native and other S3 
 browsing utilities
 - Supports multiple output buffer dirs to even out IO when uploading files
 - Supports IAM role-based authentication
 - Allows setting a default canned ACL for uploads (public, private, etc.)
 - Better error recovery handling
 - Should handle input seeks without having to download the whole file (used 
 for splits a lot)
 This code is a copy of https://github.com/Aloisius/hadoop-s3a with patches to 
 various pom files to get it to build against trunk. I've been using 0.0.1 in 
 production with CDH 4 for several months and CDH 5 for a few days. The 
 version here is 0.0.2 which changes around some keys to hopefully bring the 
 key name style more inline with the rest of hadoop 2.x.
 *Tunable parameters:*
 fs.s3a.access.key - Your AWS access key ID (omit for role authentication)
 fs.s3a.secret.key - Your AWS secret key (omit for role authentication)
 fs.s3a.connection.maximum - Controls how many parallel connections 
 HttpClient spawns (default: 15)
 fs.s3a.connection.ssl.enabled - Enables or disables SSL connections to S3 
 (default: true)
 fs.s3a.attempts.maximum - How many times we should retry commands on 
 transient errors (default: 10)
 fs.s3a.connection.timeout - Socket connect timeout (default: 5000)
 fs.s3a.paging.maximum - How many keys to request from S3 when doing 
 directory listings at a time (default: 5000)
 fs.s3a.multipart.size - How big (in bytes) to split a upload or copy 
 operation up into (default: 104857600)
 fs.s3a.multipart.threshold - Until a file is this large (in bytes), use 
 non-parallel upload (default: 2147483647)
 fs.s3a.acl.default - Set a canned ACL on newly created/copied objects 
 (private | public-read | public-read-write | authenticated-read | 
 log-delivery-write | bucket-owner-read | bucket-owner-full-control)
 fs.s3a.multipart.purge - True if you want to purge existing multipart 
 uploads that may not have been completed/aborted correctly (default: false)
 fs.s3a.multipart.purge.age - Minimum age in seconds of multipart uploads 
 to purge (default: 86400)
 fs.s3a.buffer.dir - Comma separated list of directories that will be used 
 to buffer file writes out of (default: uses ${hadoop.tmp.dir}/s3a )
 *Caveats*:
 Hadoop uses a standard output committer which uploads files as 
 filename.COPYING before renaming them. This can cause unnecessary performance 
 issues with S3 because it does not have a rename operation and S3 already 
 verifies uploads against an md5 that the driver sets on the upload request. 
 While this FileSystem should be significantly faster than the built-in 
 s3native driver because of parallel copy support, you may want to consider 
 setting a null output committer on our jobs to further improve performance.
 Because S3 requires the file length and MD5 to be known before a file is 
 uploaded, all output is buffered out to a temporary file first similar to the 
 s3native driver.
 Due to the lack of native rename() for S3, renaming extremely large files or 
 directories make take a while. Unfortunately, there is no way to notify 
 hadoop that progress is still being made for rename operations, so your job 
 may time out unless you increase the task timeout.
 This driver will fully ignore _$folder$ files. This was necessary so that it 
 could interoperate with repositories that have had the s3native driver used 
 on them, but means that it won't recognize empty directories that s3native 
 has been used on.
 Statistics for the filesystem may be calculated differently than the s3native 
 filesystem. When uploading a file, we do not count writing the temporary file 
 on the local filesystem towards the local filesystem's written bytes count. 
 When renaming files, we do not count the 

[jira] [Updated] (HADOOP-10400) Incorporate new S3A FileSystem implementation

2014-03-10 Thread Jordan Mendelson (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jordan Mendelson updated HADOOP-10400:
--

Description: 
The s3native filesystem has a number of limitations (some of which were 
recently fixed by HADOOP-9454). This patch adds an s3a filesystem which uses 
the aws-sdk instead of the jets3t library. There are a number of improvements 
over s3native including:

- Parallel copy (rename) support (dramatically speeds up commits on large files)
- AWS S3 explorer compatible empty directories files xyz/ instead of 
xyz_$folder$ (reduces littering)
- Ignores s3native created _$folder$ files created by s3native and other S3 
browsing utilities
- Supports multiple output buffer dirs to even out IO when uploading files
- Supports IAM role-based authentication
- Allows setting a default canned ACL for uploads (public, private, etc.)
- Better error recovery handling
- Should handle input seeks without having to download the whole file (used for 
splits a lot)

This code is a copy of https://github.com/Aloisius/hadoop-s3a with patches to 
various pom files to get it to build against trunk. I've been using 0.0.1 in 
production with CDH 4 for several months and CDH 5 for a few days. The version 
here is 0.0.2 which changes around some keys to hopefully bring the key name 
style more inline with the rest of hadoop 2.x.

It should be largely compatible with s3native except that it won't recognize 
s3native's empty directory marker files *_$folder$ since it uses folder/ 
like the Amazon's S3 explorer to denote empty directories.

Other caveats:

Hadoop uses a standard output committer which uploads files as filename.COPYING 
before renaming them. This can cause unnecessary performance issues with S3 
because it does not have a rename operation and S3 already verifies uploads 
against an md5 that the driver sets on the upload request. While this 
FileSystem should be significantly faster than the built-in s3native driver 
because of parallel copy support, you may want to consider setting a null 
output committer on our jobs to further improve performance.

Because S3 requires the file length be known before a file is uploaded, all 
output is buffered out to a temporary file first similar to the s3native driver.

Due to the lack of native rename() for S3, renaming extremely large files or 
directories make take a while. Unfortunately, there is no way to notify hadoop 
that progress is still being made for rename operations, so your job may time 
out unless you increase the task timeout.

Statistics for the filesystem may be calculated differently than the s3native 
filesystem. When uploading a file, we do not count writing the temporary file 
on the local filesystem towards the local filesystem's written bytes count. 
When renaming files, we do not count the S3-S3 copy as read or write 
operations. Unlike the s3native driver, we only count bytes written when we 
start the upload (as opposed to the write calls to the temporary local file). 
The driver also counts read  write ops, but they are done mostly to keep from 
timing out on large s3 operations.

This is currently implemented as a FileSystem and not a AbstractFileSystem.

  was:
The s3native filesystem has a number of limitations (some of which were 
recently fixed by HADOOP-9454). This patch adds an s3a filesystem which uses 
the aws-sdk instead of the jets3t library. There are a number of improvements 
over s3native including:

- Parallel copy (rename) support (dramatically speeds up commits on large files)
- AWS S3 explorer compatible empty directories files xyz/ instead of 
xyz_$folder$ (reduces littering)
- Ignores s3native created _$folder$ files created by s3native and other S3 
browsing utilities
- Supports multiple output buffer dirs to even out IO when uploading files
- Supports IAM role-based authentication
- Allows setting a default canned ACL for uploads (public, private, etc.)
- Better error recovery handling
- Should handle input seeks without having to download the whole file (used for 
splits a lot)

This code is a copy of https://github.com/Aloisius/hadoop-s3a with patches to 
various pom files to get it to build against trunk. I've been using 0.0.1 in 
production with CDH 4 for several months and CDH 5 for a few days. The version 
here is 0.0.2 which changes around some keys to hopefully bring the key name 
style more inline with the rest of hadoop 2.x.

It should be largely compatible with s3native except that it won't recognize 
s3native's empty directory marker files *_$folder$ since it uses folder/ 
like the Amazon's S3 explorer to denote empty directories.


 Incorporate new S3A FileSystem implementation
 -

 Key: HADOOP-10400
 URL: https://issues.apache.org/jira/browse/HADOOP-10400
 Project: Hadoop Common
  Issue Type: Improvement
   

[jira] [Updated] (HADOOP-10400) Incorporate new S3A FileSystem implementation

2014-03-10 Thread Jordan Mendelson (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jordan Mendelson updated HADOOP-10400:
--

Description: 
The s3native filesystem has a number of limitations (some of which were 
recently fixed by HADOOP-9454). This patch adds an s3a filesystem which uses 
the aws-sdk instead of the jets3t library. There are a number of improvements 
over s3native including:

- Parallel copy (rename) support (dramatically speeds up commits on large files)
- AWS S3 explorer compatible empty directories files xyz/ instead of 
xyz_$folder$ (reduces littering)
- Ignores s3native created _$folder$ files created by s3native and other S3 
browsing utilities
- Supports multiple output buffer dirs to even out IO when uploading files
- Supports IAM role-based authentication
- Allows setting a default canned ACL for uploads (public, private, etc.)
- Better error recovery handling
- Should handle input seeks without having to download the whole file (used for 
splits a lot)

This code is a copy of https://github.com/Aloisius/hadoop-s3a with patches to 
various pom files to get it to build against trunk. I've been using 0.0.1 in 
production with CDH 4 for several months and CDH 5 for a few days. The version 
here is 0.0.2 which changes around some keys to hopefully bring the key name 
style more inline with the rest of hadoop 2.x.

*Caveats*:

Hadoop uses a standard output committer which uploads files as filename.COPYING 
before renaming them. This can cause unnecessary performance issues with S3 
because it does not have a rename operation and S3 already verifies uploads 
against an md5 that the driver sets on the upload request. While this 
FileSystem should be significantly faster than the built-in s3native driver 
because of parallel copy support, you may want to consider setting a null 
output committer on our jobs to further improve performance.

Because S3 requires the file length be known before a file is uploaded, all 
output is buffered out to a temporary file first similar to the s3native driver.

Due to the lack of native rename() for S3, renaming extremely large files or 
directories make take a while. Unfortunately, there is no way to notify hadoop 
that progress is still being made for rename operations, so your job may time 
out unless you increase the task timeout.

This driver will fully ignore _$folder$ files. This was necessary so that it 
could interoperate with repositories that have had the s3native driver used on 
them, but means that it won't recognize empty directories that s3native has 
been used on.

Statistics for the filesystem may be calculated differently than the s3native 
filesystem. When uploading a file, we do not count writing the temporary file 
on the local filesystem towards the local filesystem's written bytes count. 
When renaming files, we do not count the S3-S3 copy as read or write 
operations. Unlike the s3native driver, we only count bytes written when we 
start the upload (as opposed to the write calls to the temporary local file). 
The driver also counts read  write ops, but they are done mostly to keep from 
timing out on large s3 operations.

This is currently implemented as a FileSystem and not a AbstractFileSystem.

  was:
The s3native filesystem has a number of limitations (some of which were 
recently fixed by HADOOP-9454). This patch adds an s3a filesystem which uses 
the aws-sdk instead of the jets3t library. There are a number of improvements 
over s3native including:

- Parallel copy (rename) support (dramatically speeds up commits on large files)
- AWS S3 explorer compatible empty directories files xyz/ instead of 
xyz_$folder$ (reduces littering)
- Ignores s3native created _$folder$ files created by s3native and other S3 
browsing utilities
- Supports multiple output buffer dirs to even out IO when uploading files
- Supports IAM role-based authentication
- Allows setting a default canned ACL for uploads (public, private, etc.)
- Better error recovery handling
- Should handle input seeks without having to download the whole file (used for 
splits a lot)

This code is a copy of https://github.com/Aloisius/hadoop-s3a with patches to 
various pom files to get it to build against trunk. I've been using 0.0.1 in 
production with CDH 4 for several months and CDH 5 for a few days. The version 
here is 0.0.2 which changes around some keys to hopefully bring the key name 
style more inline with the rest of hadoop 2.x.

It should be largely compatible with s3native except that it won't recognize 
s3native's empty directory marker files *_$folder$ since it uses folder/ 
like the Amazon's S3 explorer to denote empty directories.

Other caveats:

Hadoop uses a standard output committer which uploads files as filename.COPYING 
before renaming them. This can cause unnecessary performance issues with S3 
because it does not have a rename operation and S3 already verifies 

[jira] [Updated] (HADOOP-10400) Incorporate new S3A FileSystem implementation

2014-03-10 Thread Jordan Mendelson (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jordan Mendelson updated HADOOP-10400:
--

Target Version/s:   (was: 2.4.0)
  Status: Patch Available  (was: Open)

 Incorporate new S3A FileSystem implementation
 -

 Key: HADOOP-10400
 URL: https://issues.apache.org/jira/browse/HADOOP-10400
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs
Reporter: Jordan Mendelson

 The s3native filesystem has a number of limitations (some of which were 
 recently fixed by HADOOP-9454). This patch adds an s3a filesystem which uses 
 the aws-sdk instead of the jets3t library. There are a number of improvements 
 over s3native including:
 - Parallel copy (rename) support (dramatically speeds up commits on large 
 files)
 - AWS S3 explorer compatible empty directories files xyz/ instead of 
 xyz_$folder$ (reduces littering)
 - Ignores s3native created _$folder$ files created by s3native and other S3 
 browsing utilities
 - Supports multiple output buffer dirs to even out IO when uploading files
 - Supports IAM role-based authentication
 - Allows setting a default canned ACL for uploads (public, private, etc.)
 - Better error recovery handling
 - Should handle input seeks without having to download the whole file (used 
 for splits a lot)
 This code is a copy of https://github.com/Aloisius/hadoop-s3a with patches to 
 various pom files to get it to build against trunk. I've been using 0.0.1 in 
 production with CDH 4 for several months and CDH 5 for a few days. The 
 version here is 0.0.2 which changes around some keys to hopefully bring the 
 key name style more inline with the rest of hadoop 2.x.
 *Caveats*:
 Hadoop uses a standard output committer which uploads files as 
 filename.COPYING before renaming them. This can cause unnecessary performance 
 issues with S3 because it does not have a rename operation and S3 already 
 verifies uploads against an md5 that the driver sets on the upload request. 
 While this FileSystem should be significantly faster than the built-in 
 s3native driver because of parallel copy support, you may want to consider 
 setting a null output committer on our jobs to further improve performance.
 Because S3 requires the file length be known before a file is uploaded, all 
 output is buffered out to a temporary file first similar to the s3native 
 driver.
 Due to the lack of native rename() for S3, renaming extremely large files or 
 directories make take a while. Unfortunately, there is no way to notify 
 hadoop that progress is still being made for rename operations, so your job 
 may time out unless you increase the task timeout.
 This driver will fully ignore _$folder$ files. This was necessary so that it 
 could interoperate with repositories that have had the s3native driver used 
 on them, but means that it won't recognize empty directories that s3native 
 has been used on.
 Statistics for the filesystem may be calculated differently than the s3native 
 filesystem. When uploading a file, we do not count writing the temporary file 
 on the local filesystem towards the local filesystem's written bytes count. 
 When renaming files, we do not count the S3-S3 copy as read or write 
 operations. Unlike the s3native driver, we only count bytes written when we 
 start the upload (as opposed to the write calls to the temporary local file). 
 The driver also counts read  write ops, but they are done mostly to keep 
 from timing out on large s3 operations.
 This is currently implemented as a FileSystem and not a AbstractFileSystem.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HADOOP-10400) Incorporate new S3A FileSystem implementation

2014-03-10 Thread Jordan Mendelson (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jordan Mendelson updated HADOOP-10400:
--

Attachment: HADOOP-10400-1.patch

 Incorporate new S3A FileSystem implementation
 -

 Key: HADOOP-10400
 URL: https://issues.apache.org/jira/browse/HADOOP-10400
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs
Reporter: Jordan Mendelson
 Attachments: HADOOP-10400-1.patch


 The s3native filesystem has a number of limitations (some of which were 
 recently fixed by HADOOP-9454). This patch adds an s3a filesystem which uses 
 the aws-sdk instead of the jets3t library. There are a number of improvements 
 over s3native including:
 - Parallel copy (rename) support (dramatically speeds up commits on large 
 files)
 - AWS S3 explorer compatible empty directories files xyz/ instead of 
 xyz_$folder$ (reduces littering)
 - Ignores s3native created _$folder$ files created by s3native and other S3 
 browsing utilities
 - Supports multiple output buffer dirs to even out IO when uploading files
 - Supports IAM role-based authentication
 - Allows setting a default canned ACL for uploads (public, private, etc.)
 - Better error recovery handling
 - Should handle input seeks without having to download the whole file (used 
 for splits a lot)
 This code is a copy of https://github.com/Aloisius/hadoop-s3a with patches to 
 various pom files to get it to build against trunk. I've been using 0.0.1 in 
 production with CDH 4 for several months and CDH 5 for a few days. The 
 version here is 0.0.2 which changes around some keys to hopefully bring the 
 key name style more inline with the rest of hadoop 2.x.
 *Caveats*:
 Hadoop uses a standard output committer which uploads files as 
 filename.COPYING before renaming them. This can cause unnecessary performance 
 issues with S3 because it does not have a rename operation and S3 already 
 verifies uploads against an md5 that the driver sets on the upload request. 
 While this FileSystem should be significantly faster than the built-in 
 s3native driver because of parallel copy support, you may want to consider 
 setting a null output committer on our jobs to further improve performance.
 Because S3 requires the file length be known before a file is uploaded, all 
 output is buffered out to a temporary file first similar to the s3native 
 driver.
 Due to the lack of native rename() for S3, renaming extremely large files or 
 directories make take a while. Unfortunately, there is no way to notify 
 hadoop that progress is still being made for rename operations, so your job 
 may time out unless you increase the task timeout.
 This driver will fully ignore _$folder$ files. This was necessary so that it 
 could interoperate with repositories that have had the s3native driver used 
 on them, but means that it won't recognize empty directories that s3native 
 has been used on.
 Statistics for the filesystem may be calculated differently than the s3native 
 filesystem. When uploading a file, we do not count writing the temporary file 
 on the local filesystem towards the local filesystem's written bytes count. 
 When renaming files, we do not count the S3-S3 copy as read or write 
 operations. Unlike the s3native driver, we only count bytes written when we 
 start the upload (as opposed to the write calls to the temporary local file). 
 The driver also counts read  write ops, but they are done mostly to keep 
 from timing out on large s3 operations.
 This is currently implemented as a FileSystem and not a AbstractFileSystem.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HADOOP-10400) Incorporate new S3A FileSystem implementation

2014-03-10 Thread Jordan Mendelson (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jordan Mendelson updated HADOOP-10400:
--

Description: 
The s3native filesystem has a number of limitations (some of which were 
recently fixed by HADOOP-9454). This patch adds an s3a filesystem which uses 
the aws-sdk instead of the jets3t library. There are a number of improvements 
over s3native including:

- Parallel copy (rename) support (dramatically speeds up commits on large files)
- AWS S3 explorer compatible empty directories files xyz/ instead of 
xyz_$folder$ (reduces littering)
- Ignores s3native created _$folder$ files created by s3native and other S3 
browsing utilities
- Supports multiple output buffer dirs to even out IO when uploading files
- Supports IAM role-based authentication
- Allows setting a default canned ACL for uploads (public, private, etc.)
- Better error recovery handling
- Should handle input seeks without having to download the whole file (used for 
splits a lot)

This code is a copy of https://github.com/Aloisius/hadoop-s3a with patches to 
various pom files to get it to build against trunk. I've been using 0.0.1 in 
production with CDH 4 for several months and CDH 5 for a few days. The version 
here is 0.0.2 which changes around some keys to hopefully bring the key name 
style more inline with the rest of hadoop 2.x.

*Caveats*:

Hadoop uses a standard output committer which uploads files as filename.COPYING 
before renaming them. This can cause unnecessary performance issues with S3 
because it does not have a rename operation and S3 already verifies uploads 
against an md5 that the driver sets on the upload request. While this 
FileSystem should be significantly faster than the built-in s3native driver 
because of parallel copy support, you may want to consider setting a null 
output committer on our jobs to further improve performance.

Because S3 requires the file length and MD5 to be known before a file is 
uploaded, all output is buffered out to a temporary file first similar to the 
s3native driver.

Due to the lack of native rename() for S3, renaming extremely large files or 
directories make take a while. Unfortunately, there is no way to notify hadoop 
that progress is still being made for rename operations, so your job may time 
out unless you increase the task timeout.

This driver will fully ignore _$folder$ files. This was necessary so that it 
could interoperate with repositories that have had the s3native driver used on 
them, but means that it won't recognize empty directories that s3native has 
been used on.

Statistics for the filesystem may be calculated differently than the s3native 
filesystem. When uploading a file, we do not count writing the temporary file 
on the local filesystem towards the local filesystem's written bytes count. 
When renaming files, we do not count the S3-S3 copy as read or write 
operations. Unlike the s3native driver, we only count bytes written when we 
start the upload (as opposed to the write calls to the temporary local file). 
The driver also counts read  write ops, but they are done mostly to keep from 
timing out on large s3 operations.

This is currently implemented as a FileSystem and not a AbstractFileSystem.

  was:
The s3native filesystem has a number of limitations (some of which were 
recently fixed by HADOOP-9454). This patch adds an s3a filesystem which uses 
the aws-sdk instead of the jets3t library. There are a number of improvements 
over s3native including:

- Parallel copy (rename) support (dramatically speeds up commits on large files)
- AWS S3 explorer compatible empty directories files xyz/ instead of 
xyz_$folder$ (reduces littering)
- Ignores s3native created _$folder$ files created by s3native and other S3 
browsing utilities
- Supports multiple output buffer dirs to even out IO when uploading files
- Supports IAM role-based authentication
- Allows setting a default canned ACL for uploads (public, private, etc.)
- Better error recovery handling
- Should handle input seeks without having to download the whole file (used for 
splits a lot)

This code is a copy of https://github.com/Aloisius/hadoop-s3a with patches to 
various pom files to get it to build against trunk. I've been using 0.0.1 in 
production with CDH 4 for several months and CDH 5 for a few days. The version 
here is 0.0.2 which changes around some keys to hopefully bring the key name 
style more inline with the rest of hadoop 2.x.

*Caveats*:

Hadoop uses a standard output committer which uploads files as filename.COPYING 
before renaming them. This can cause unnecessary performance issues with S3 
because it does not have a rename operation and S3 already verifies uploads 
against an md5 that the driver sets on the upload request. While this 
FileSystem should be significantly faster than the built-in s3native driver 
because of parallel copy support, you may want 

[jira] [Updated] (HADOOP-10400) Incorporate new S3A FileSystem implementation

2014-03-10 Thread Jordan Mendelson (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jordan Mendelson updated HADOOP-10400:
--

Description: 
The s3native filesystem has a number of limitations (some of which were 
recently fixed by HADOOP-9454). This patch adds an s3a filesystem which uses 
the aws-sdk instead of the jets3t library. There are a number of improvements 
over s3native including:

- Parallel copy (rename) support (dramatically speeds up commits on large files)
- AWS S3 explorer compatible empty directories files xyz/ instead of 
xyz_$folder$ (reduces littering)
- Ignores s3native created _$folder$ files created by s3native and other S3 
browsing utilities
- Supports multiple output buffer dirs to even out IO when uploading files
- Supports IAM role-based authentication
- Allows setting a default canned ACL for uploads (public, private, etc.)
- Better error recovery handling
- Should handle input seeks without having to download the whole file (used for 
splits a lot)

This code is a copy of https://github.com/Aloisius/hadoop-s3a with patches to 
various pom files to get it to build against trunk. I've been using 0.0.1 in 
production with CDH 4 for several months and CDH 5 for a few days. The version 
here is 0.0.2 which changes around some keys to hopefully bring the key name 
style more inline with the rest of hadoop 2.x.

*Tunable parameters:*

fs.s3a.access.key - Your AWS access key ID (omit for role authentication)
fs.s3a.secret.key - Your AWS secret key (omit for role authentication)
fs.s3a.connection.maximum - Controls how many parallel connections 
HttpClient spawns (default: 15)
fs.s3a.connection.ssl.enabled - Enables or disables SSL connections to S3 
(default: true)
fs.s3a.attempts.maximum - How many times we should retry commands on 
transient errors (default: 10)
fs.s3a.connection.timeout - Socket connect timeout (default: 5000)
fs.s3a.paging.maximum - How many keys to request from S3 when doing 
directory listings at a time (default: 5000)
fs.s3a.multipart.size - How big (in bytes) to split a upload or copy 
operation up into (default: 104857600)
fs.s3a.multipart.threshold - Until a file is this large (in bytes), use 
non-parallel upload (default: 104857600)
fs.s3a.acl.default - Set a canned ACL on newly created/copied objects 
(private | public-read | public-read-write | authenticated-read | 
log-delivery-write | bucket-owner-read | bucket-owner-full-control)
fs.s3a.multipart.purge - True if you want to purge existing multipart 
uploads that may not have been completed/aborted correctly (default: false)
fs.s3a.multipart.purge.age - Minimum age in seconds of multipart uploads to 
purge (default: 86400)
fs.s3a.buffer.dir - Comma separated list of directories that will be used 
to buffer file writes out of (default: uses fs.s3.buffer.dir)


*Caveats*:

Hadoop uses a standard output committer which uploads files as filename.COPYING 
before renaming them. This can cause unnecessary performance issues with S3 
because it does not have a rename operation and S3 already verifies uploads 
against an md5 that the driver sets on the upload request. While this 
FileSystem should be significantly faster than the built-in s3native driver 
because of parallel copy support, you may want to consider setting a null 
output committer on our jobs to further improve performance.

Because S3 requires the file length and MD5 to be known before a file is 
uploaded, all output is buffered out to a temporary file first similar to the 
s3native driver.

Due to the lack of native rename() for S3, renaming extremely large files or 
directories make take a while. Unfortunately, there is no way to notify hadoop 
that progress is still being made for rename operations, so your job may time 
out unless you increase the task timeout.

This driver will fully ignore _$folder$ files. This was necessary so that it 
could interoperate with repositories that have had the s3native driver used on 
them, but means that it won't recognize empty directories that s3native has 
been used on.

Statistics for the filesystem may be calculated differently than the s3native 
filesystem. When uploading a file, we do not count writing the temporary file 
on the local filesystem towards the local filesystem's written bytes count. 
When renaming files, we do not count the S3-S3 copy as read or write 
operations. Unlike the s3native driver, we only count bytes written when we 
start the upload (as opposed to the write calls to the temporary local file). 
The driver also counts read  write ops, but they are done mostly to keep from 
timing out on large s3 operations.

This is currently implemented as a FileSystem and not a AbstractFileSystem.

  was:
The s3native filesystem has a number of limitations (some of which were 
recently fixed by HADOOP-9454). This patch adds an s3a filesystem which uses 

[jira] [Updated] (HADOOP-10400) Incorporate new S3A FileSystem implementation

2014-03-10 Thread Jordan Mendelson (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jordan Mendelson updated HADOOP-10400:
--

Description: 
The s3native filesystem has a number of limitations (some of which were 
recently fixed by HADOOP-9454). This patch adds an s3a filesystem which uses 
the aws-sdk instead of the jets3t library. There are a number of improvements 
over s3native including:

- Parallel copy (rename) support (dramatically speeds up commits on large files)
- AWS S3 explorer compatible empty directories files xyz/ instead of 
xyz_$folder$ (reduces littering)
- Ignores s3native created _$folder$ files created by s3native and other S3 
browsing utilities
- Supports multiple output buffer dirs to even out IO when uploading files
- Supports IAM role-based authentication
- Allows setting a default canned ACL for uploads (public, private, etc.)
- Better error recovery handling
- Should handle input seeks without having to download the whole file (used for 
splits a lot)

This code is a copy of https://github.com/Aloisius/hadoop-s3a with patches to 
various pom files to get it to build against trunk. I've been using 0.0.1 in 
production with CDH 4 for several months and CDH 5 for a few days. The version 
here is 0.0.2 which changes around some keys to hopefully bring the key name 
style more inline with the rest of hadoop 2.x.

*Tunable parameters:*

fs.s3a.access.key - Your AWS access key ID (omit for role authentication)
fs.s3a.secret.key - Your AWS secret key (omit for role authentication)
fs.s3a.connection.maximum - Controls how many parallel connections 
HttpClient spawns (default: 15)
fs.s3a.connection.ssl.enabled - Enables or disables SSL connections to S3 
(default: true)
fs.s3a.attempts.maximum - How many times we should retry commands on 
transient errors (default: 10)
fs.s3a.connection.timeout - Socket connect timeout (default: 5000)
fs.s3a.paging.maximum - How many keys to request from S3 when doing 
directory listings at a time (default: 5000)
fs.s3a.multipart.size - How big (in bytes) to split a upload or copy 
operation up into (default: 104857600)
fs.s3a.multipart.threshold - Until a file is this large (in bytes), use 
non-parallel upload (default: 524288000)
fs.s3a.acl.default - Set a canned ACL on newly created/copied objects 
(private | public-read | public-read-write | authenticated-read | 
log-delivery-write | bucket-owner-read | bucket-owner-full-control)
fs.s3a.multipart.purge - True if you want to purge existing multipart 
uploads that may not have been completed/aborted correctly (default: false)
fs.s3a.multipart.purge.age - Minimum age in seconds of multipart uploads to 
purge (default: 86400)
fs.s3a.buffer.dir - Comma separated list of directories that will be used 
to buffer file writes out of (default: uses fs.s3.buffer.dir)


*Caveats*:

Hadoop uses a standard output committer which uploads files as filename.COPYING 
before renaming them. This can cause unnecessary performance issues with S3 
because it does not have a rename operation and S3 already verifies uploads 
against an md5 that the driver sets on the upload request. While this 
FileSystem should be significantly faster than the built-in s3native driver 
because of parallel copy support, you may want to consider setting a null 
output committer on our jobs to further improve performance.

Because S3 requires the file length and MD5 to be known before a file is 
uploaded, all output is buffered out to a temporary file first similar to the 
s3native driver.

Due to the lack of native rename() for S3, renaming extremely large files or 
directories make take a while. Unfortunately, there is no way to notify hadoop 
that progress is still being made for rename operations, so your job may time 
out unless you increase the task timeout.

This driver will fully ignore _$folder$ files. This was necessary so that it 
could interoperate with repositories that have had the s3native driver used on 
them, but means that it won't recognize empty directories that s3native has 
been used on.

Statistics for the filesystem may be calculated differently than the s3native 
filesystem. When uploading a file, we do not count writing the temporary file 
on the local filesystem towards the local filesystem's written bytes count. 
When renaming files, we do not count the S3-S3 copy as read or write 
operations. Unlike the s3native driver, we only count bytes written when we 
start the upload (as opposed to the write calls to the temporary local file). 
The driver also counts read  write ops, but they are done mostly to keep from 
timing out on large s3 operations.

This is currently implemented as a FileSystem and not a AbstractFileSystem.

  was:
The s3native filesystem has a number of limitations (some of which were 
recently fixed by HADOOP-9454). This patch adds an s3a filesystem which uses 

[jira] [Updated] (HADOOP-10400) Incorporate new S3A FileSystem implementation

2014-03-10 Thread Jordan Mendelson (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jordan Mendelson updated HADOOP-10400:
--

Description: 
The s3native filesystem has a number of limitations (some of which were 
recently fixed by HADOOP-9454). This patch adds an s3a filesystem which uses 
the aws-sdk instead of the jets3t library. There are a number of improvements 
over s3native including:

- Parallel copy (rename) support (dramatically speeds up commits on large files)
- AWS S3 explorer compatible empty directories files xyz/ instead of 
xyz_$folder$ (reduces littering)
- Ignores s3native created _$folder$ files created by s3native and other S3 
browsing utilities
- Supports multiple output buffer dirs to even out IO when uploading files
- Supports IAM role-based authentication
- Allows setting a default canned ACL for uploads (public, private, etc.)
- Better error recovery handling
- Should handle input seeks without having to download the whole file (used for 
splits a lot)

This code is a copy of https://github.com/Aloisius/hadoop-s3a with patches to 
various pom files to get it to build against trunk. I've been using 0.0.1 in 
production with CDH 4 for several months and CDH 5 for a few days. The version 
here is 0.0.2 which changes around some keys to hopefully bring the key name 
style more inline with the rest of hadoop 2.x.

*Tunable parameters:*

fs.s3a.access.key - Your AWS access key ID (omit for role authentication)
fs.s3a.secret.key - Your AWS secret key (omit for role authentication)
fs.s3a.connection.maximum - Controls how many parallel connections 
HttpClient spawns (default: 15)
fs.s3a.connection.ssl.enabled - Enables or disables SSL connections to S3 
(default: true)
fs.s3a.attempts.maximum - How many times we should retry commands on 
transient errors (default: 10)
fs.s3a.connection.timeout - Socket connect timeout (default: 5000)
fs.s3a.paging.maximum - How many keys to request from S3 when doing 
directory listings at a time (default: 5000)
fs.s3a.multipart.size - How big (in bytes) to split a upload or copy 
operation up into (default: 104857600)
fs.s3a.multipart.threshold - Until a file is this large (in bytes), use 
non-parallel upload (default: 2147483647)
fs.s3a.acl.default - Set a canned ACL on newly created/copied objects 
(private | public-read | public-read-write | authenticated-read | 
log-delivery-write | bucket-owner-read | bucket-owner-full-control)
fs.s3a.multipart.purge - True if you want to purge existing multipart 
uploads that may not have been completed/aborted correctly (default: false)
fs.s3a.multipart.purge.age - Minimum age in seconds of multipart uploads to 
purge (default: 86400)
fs.s3a.buffer.dir - Comma separated list of directories that will be used 
to buffer file writes out of (default: uses fs.s3.buffer.dir)


*Caveats*:

Hadoop uses a standard output committer which uploads files as filename.COPYING 
before renaming them. This can cause unnecessary performance issues with S3 
because it does not have a rename operation and S3 already verifies uploads 
against an md5 that the driver sets on the upload request. While this 
FileSystem should be significantly faster than the built-in s3native driver 
because of parallel copy support, you may want to consider setting a null 
output committer on our jobs to further improve performance.

Because S3 requires the file length and MD5 to be known before a file is 
uploaded, all output is buffered out to a temporary file first similar to the 
s3native driver.

Due to the lack of native rename() for S3, renaming extremely large files or 
directories make take a while. Unfortunately, there is no way to notify hadoop 
that progress is still being made for rename operations, so your job may time 
out unless you increase the task timeout.

This driver will fully ignore _$folder$ files. This was necessary so that it 
could interoperate with repositories that have had the s3native driver used on 
them, but means that it won't recognize empty directories that s3native has 
been used on.

Statistics for the filesystem may be calculated differently than the s3native 
filesystem. When uploading a file, we do not count writing the temporary file 
on the local filesystem towards the local filesystem's written bytes count. 
When renaming files, we do not count the S3-S3 copy as read or write 
operations. Unlike the s3native driver, we only count bytes written when we 
start the upload (as opposed to the write calls to the temporary local file). 
The driver also counts read  write ops, but they are done mostly to keep from 
timing out on large s3 operations.

The AWS SDK unfortunately passes the multipart threshold as an int which means
fs.s3a.multipart.threshold can not be greater than 2^31-1 (2147483647).

This is currently implemented as a FileSystem and not a AbstractFileSystem.

  was:

[jira] [Updated] (HADOOP-10400) Incorporate new S3A FileSystem implementation

2014-03-10 Thread Jordan Mendelson (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jordan Mendelson updated HADOOP-10400:
--

Attachment: HADOOP-10400-2.patch

Sets the core-defaults.xml file to properly match the defaults in the s3a 
driver.

 Incorporate new S3A FileSystem implementation
 -

 Key: HADOOP-10400
 URL: https://issues.apache.org/jira/browse/HADOOP-10400
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs
Reporter: Jordan Mendelson
 Attachments: HADOOP-10400-1.patch, HADOOP-10400-2.patch


 The s3native filesystem has a number of limitations (some of which were 
 recently fixed by HADOOP-9454). This patch adds an s3a filesystem which uses 
 the aws-sdk instead of the jets3t library. There are a number of improvements 
 over s3native including:
 - Parallel copy (rename) support (dramatically speeds up commits on large 
 files)
 - AWS S3 explorer compatible empty directories files xyz/ instead of 
 xyz_$folder$ (reduces littering)
 - Ignores s3native created _$folder$ files created by s3native and other S3 
 browsing utilities
 - Supports multiple output buffer dirs to even out IO when uploading files
 - Supports IAM role-based authentication
 - Allows setting a default canned ACL for uploads (public, private, etc.)
 - Better error recovery handling
 - Should handle input seeks without having to download the whole file (used 
 for splits a lot)
 This code is a copy of https://github.com/Aloisius/hadoop-s3a with patches to 
 various pom files to get it to build against trunk. I've been using 0.0.1 in 
 production with CDH 4 for several months and CDH 5 for a few days. The 
 version here is 0.0.2 which changes around some keys to hopefully bring the 
 key name style more inline with the rest of hadoop 2.x.
 *Tunable parameters:*
 fs.s3a.access.key - Your AWS access key ID (omit for role authentication)
 fs.s3a.secret.key - Your AWS secret key (omit for role authentication)
 fs.s3a.connection.maximum - Controls how many parallel connections 
 HttpClient spawns (default: 15)
 fs.s3a.connection.ssl.enabled - Enables or disables SSL connections to S3 
 (default: true)
 fs.s3a.attempts.maximum - How many times we should retry commands on 
 transient errors (default: 10)
 fs.s3a.connection.timeout - Socket connect timeout (default: 5000)
 fs.s3a.paging.maximum - How many keys to request from S3 when doing 
 directory listings at a time (default: 5000)
 fs.s3a.multipart.size - How big (in bytes) to split a upload or copy 
 operation up into (default: 104857600)
 fs.s3a.multipart.threshold - Until a file is this large (in bytes), use 
 non-parallel upload (default: 2147483647)
 fs.s3a.acl.default - Set a canned ACL on newly created/copied objects 
 (private | public-read | public-read-write | authenticated-read | 
 log-delivery-write | bucket-owner-read | bucket-owner-full-control)
 fs.s3a.multipart.purge - True if you want to purge existing multipart 
 uploads that may not have been completed/aborted correctly (default: false)
 fs.s3a.multipart.purge.age - Minimum age in seconds of multipart uploads 
 to purge (default: 86400)
 fs.s3a.buffer.dir - Comma separated list of directories that will be used 
 to buffer file writes out of (default: uses fs.s3.buffer.dir)
 *Caveats*:
 Hadoop uses a standard output committer which uploads files as 
 filename.COPYING before renaming them. This can cause unnecessary performance 
 issues with S3 because it does not have a rename operation and S3 already 
 verifies uploads against an md5 that the driver sets on the upload request. 
 While this FileSystem should be significantly faster than the built-in 
 s3native driver because of parallel copy support, you may want to consider 
 setting a null output committer on our jobs to further improve performance.
 Because S3 requires the file length and MD5 to be known before a file is 
 uploaded, all output is buffered out to a temporary file first similar to the 
 s3native driver.
 Due to the lack of native rename() for S3, renaming extremely large files or 
 directories make take a while. Unfortunately, there is no way to notify 
 hadoop that progress is still being made for rename operations, so your job 
 may time out unless you increase the task timeout.
 This driver will fully ignore _$folder$ files. This was necessary so that it 
 could interoperate with repositories that have had the s3native driver used 
 on them, but means that it won't recognize empty directories that s3native 
 has been used on.
 Statistics for the filesystem may be calculated differently than the s3native 
 filesystem. When uploading a file, we do not count writing the temporary file 
 on the local filesystem towards the local filesystem's written bytes count. 
 When 

[jira] [Updated] (HADOOP-10400) Incorporate new S3A FileSystem implementation

2014-03-10 Thread Jordan Mendelson (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jordan Mendelson updated HADOOP-10400:
--

Attachment: HADOOP-10400-3.patch

HADOOP-10400-3 should take care of the linter problems.

 Incorporate new S3A FileSystem implementation
 -

 Key: HADOOP-10400
 URL: https://issues.apache.org/jira/browse/HADOOP-10400
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs
Reporter: Jordan Mendelson
 Attachments: HADOOP-10400-1.patch, HADOOP-10400-2.patch, 
 HADOOP-10400-3.patch


 The s3native filesystem has a number of limitations (some of which were 
 recently fixed by HADOOP-9454). This patch adds an s3a filesystem which uses 
 the aws-sdk instead of the jets3t library. There are a number of improvements 
 over s3native including:
 - Parallel copy (rename) support (dramatically speeds up commits on large 
 files)
 - AWS S3 explorer compatible empty directories files xyz/ instead of 
 xyz_$folder$ (reduces littering)
 - Ignores s3native created _$folder$ files created by s3native and other S3 
 browsing utilities
 - Supports multiple output buffer dirs to even out IO when uploading files
 - Supports IAM role-based authentication
 - Allows setting a default canned ACL for uploads (public, private, etc.)
 - Better error recovery handling
 - Should handle input seeks without having to download the whole file (used 
 for splits a lot)
 This code is a copy of https://github.com/Aloisius/hadoop-s3a with patches to 
 various pom files to get it to build against trunk. I've been using 0.0.1 in 
 production with CDH 4 for several months and CDH 5 for a few days. The 
 version here is 0.0.2 which changes around some keys to hopefully bring the 
 key name style more inline with the rest of hadoop 2.x.
 *Tunable parameters:*
 fs.s3a.access.key - Your AWS access key ID (omit for role authentication)
 fs.s3a.secret.key - Your AWS secret key (omit for role authentication)
 fs.s3a.connection.maximum - Controls how many parallel connections 
 HttpClient spawns (default: 15)
 fs.s3a.connection.ssl.enabled - Enables or disables SSL connections to S3 
 (default: true)
 fs.s3a.attempts.maximum - How many times we should retry commands on 
 transient errors (default: 10)
 fs.s3a.connection.timeout - Socket connect timeout (default: 5000)
 fs.s3a.paging.maximum - How many keys to request from S3 when doing 
 directory listings at a time (default: 5000)
 fs.s3a.multipart.size - How big (in bytes) to split a upload or copy 
 operation up into (default: 104857600)
 fs.s3a.multipart.threshold - Until a file is this large (in bytes), use 
 non-parallel upload (default: 2147483647)
 fs.s3a.acl.default - Set a canned ACL on newly created/copied objects 
 (private | public-read | public-read-write | authenticated-read | 
 log-delivery-write | bucket-owner-read | bucket-owner-full-control)
 fs.s3a.multipart.purge - True if you want to purge existing multipart 
 uploads that may not have been completed/aborted correctly (default: false)
 fs.s3a.multipart.purge.age - Minimum age in seconds of multipart uploads 
 to purge (default: 86400)
 fs.s3a.buffer.dir - Comma separated list of directories that will be used 
 to buffer file writes out of (default: uses fs.s3.buffer.dir)
 *Caveats*:
 Hadoop uses a standard output committer which uploads files as 
 filename.COPYING before renaming them. This can cause unnecessary performance 
 issues with S3 because it does not have a rename operation and S3 already 
 verifies uploads against an md5 that the driver sets on the upload request. 
 While this FileSystem should be significantly faster than the built-in 
 s3native driver because of parallel copy support, you may want to consider 
 setting a null output committer on our jobs to further improve performance.
 Because S3 requires the file length and MD5 to be known before a file is 
 uploaded, all output is buffered out to a temporary file first similar to the 
 s3native driver.
 Due to the lack of native rename() for S3, renaming extremely large files or 
 directories make take a while. Unfortunately, there is no way to notify 
 hadoop that progress is still being made for rename operations, so your job 
 may time out unless you increase the task timeout.
 This driver will fully ignore _$folder$ files. This was necessary so that it 
 could interoperate with repositories that have had the s3native driver used 
 on them, but means that it won't recognize empty directories that s3native 
 has been used on.
 Statistics for the filesystem may be calculated differently than the s3native 
 filesystem. When uploading a file, we do not count writing the temporary file 
 on the local filesystem towards the local filesystem's written bytes count. 
 When 

[jira] [Updated] (HADOOP-10400) Incorporate new S3A FileSystem implementation

2014-03-10 Thread Jordan Mendelson (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jordan Mendelson updated HADOOP-10400:
--

Description: 
The s3native filesystem has a number of limitations (some of which were 
recently fixed by HADOOP-9454). This patch adds an s3a filesystem which uses 
the aws-sdk instead of the jets3t library. There are a number of improvements 
over s3native including:

- Parallel copy (rename) support (dramatically speeds up commits on large files)
- AWS S3 explorer compatible empty directories files xyz/ instead of 
xyz_$folder$ (reduces littering)
- Ignores s3native created _$folder$ files created by s3native and other S3 
browsing utilities
- Supports multiple output buffer dirs to even out IO when uploading files
- Supports IAM role-based authentication
- Allows setting a default canned ACL for uploads (public, private, etc.)
- Better error recovery handling
- Should handle input seeks without having to download the whole file (used for 
splits a lot)

This code is a copy of https://github.com/Aloisius/hadoop-s3a with patches to 
various pom files to get it to build against trunk. I've been using 0.0.1 in 
production with CDH 4 for several months and CDH 5 for a few days. The version 
here is 0.0.2 which changes around some keys to hopefully bring the key name 
style more inline with the rest of hadoop 2.x.

*Tunable parameters:*

fs.s3a.access.key - Your AWS access key ID (omit for role authentication)
fs.s3a.secret.key - Your AWS secret key (omit for role authentication)
fs.s3a.connection.maximum - Controls how many parallel connections 
HttpClient spawns (default: 15)
fs.s3a.connection.ssl.enabled - Enables or disables SSL connections to S3 
(default: true)
fs.s3a.attempts.maximum - How many times we should retry commands on 
transient errors (default: 10)
fs.s3a.connection.timeout - Socket connect timeout (default: 5000)
fs.s3a.paging.maximum - How many keys to request from S3 when doing 
directory listings at a time (default: 5000)
fs.s3a.multipart.size - How big (in bytes) to split a upload or copy 
operation up into (default: 104857600)
fs.s3a.multipart.threshold - Until a file is this large (in bytes), use 
non-parallel upload (default: 2147483647)
fs.s3a.acl.default - Set a canned ACL on newly created/copied objects 
(private | public-read | public-read-write | authenticated-read | 
log-delivery-write | bucket-owner-read | bucket-owner-full-control)
fs.s3a.multipart.purge - True if you want to purge existing multipart 
uploads that may not have been completed/aborted correctly (default: false)
fs.s3a.multipart.purge.age - Minimum age in seconds of multipart uploads to 
purge (default: 86400)
fs.s3a.buffer.dir - Comma separated list of directories that will be used 
to buffer file writes out of (default: uses ${hadoop.tmp.dir}/s3a)


*Caveats*:

Hadoop uses a standard output committer which uploads files as filename.COPYING 
before renaming them. This can cause unnecessary performance issues with S3 
because it does not have a rename operation and S3 already verifies uploads 
against an md5 that the driver sets on the upload request. While this 
FileSystem should be significantly faster than the built-in s3native driver 
because of parallel copy support, you may want to consider setting a null 
output committer on our jobs to further improve performance.

Because S3 requires the file length and MD5 to be known before a file is 
uploaded, all output is buffered out to a temporary file first similar to the 
s3native driver.

Due to the lack of native rename() for S3, renaming extremely large files or 
directories make take a while. Unfortunately, there is no way to notify hadoop 
that progress is still being made for rename operations, so your job may time 
out unless you increase the task timeout.

This driver will fully ignore _$folder$ files. This was necessary so that it 
could interoperate with repositories that have had the s3native driver used on 
them, but means that it won't recognize empty directories that s3native has 
been used on.

Statistics for the filesystem may be calculated differently than the s3native 
filesystem. When uploading a file, we do not count writing the temporary file 
on the local filesystem towards the local filesystem's written bytes count. 
When renaming files, we do not count the S3-S3 copy as read or write 
operations. Unlike the s3native driver, we only count bytes written when we 
start the upload (as opposed to the write calls to the temporary local file). 
The driver also counts read  write ops, but they are done mostly to keep from 
timing out on large s3 operations.

The AWS SDK unfortunately passes the multipart threshold as an int which means
fs.s3a.multipart.threshold can not be greater than 2^31-1 (2147483647).

This is currently implemented as a FileSystem and not a AbstractFileSystem.

  

[jira] [Updated] (HADOOP-10400) Incorporate new S3A FileSystem implementation

2014-03-10 Thread Jordan Mendelson (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jordan Mendelson updated HADOOP-10400:
--

Description: 
The s3native filesystem has a number of limitations (some of which were 
recently fixed by HADOOP-9454). This patch adds an s3a filesystem which uses 
the aws-sdk instead of the jets3t library. There are a number of improvements 
over s3native including:

- Parallel copy (rename) support (dramatically speeds up commits on large files)
- AWS S3 explorer compatible empty directories files xyz/ instead of 
xyz_$folder$ (reduces littering)
- Ignores s3native created _$folder$ files created by s3native and other S3 
browsing utilities
- Supports multiple output buffer dirs to even out IO when uploading files
- Supports IAM role-based authentication
- Allows setting a default canned ACL for uploads (public, private, etc.)
- Better error recovery handling
- Should handle input seeks without having to download the whole file (used for 
splits a lot)

This code is a copy of https://github.com/Aloisius/hadoop-s3a with patches to 
various pom files to get it to build against trunk. I've been using 0.0.1 in 
production with CDH 4 for several months and CDH 5 for a few days. The version 
here is 0.0.2 which changes around some keys to hopefully bring the key name 
style more inline with the rest of hadoop 2.x.

*Tunable parameters:*

fs.s3a.access.key - Your AWS access key ID (omit for role authentication)
fs.s3a.secret.key - Your AWS secret key (omit for role authentication)
fs.s3a.connection.maximum - Controls how many parallel connections 
HttpClient spawns (default: 15)
fs.s3a.connection.ssl.enabled - Enables or disables SSL connections to S3 
(default: true)
fs.s3a.attempts.maximum - How many times we should retry commands on 
transient errors (default: 10)
fs.s3a.connection.timeout - Socket connect timeout (default: 5000)
fs.s3a.paging.maximum - How many keys to request from S3 when doing 
directory listings at a time (default: 5000)
fs.s3a.multipart.size - How big (in bytes) to split a upload or copy 
operation up into (default: 104857600)
fs.s3a.multipart.threshold - Until a file is this large (in bytes), use 
non-parallel upload (default: 2147483647)
fs.s3a.acl.default - Set a canned ACL on newly created/copied objects 
(private | public-read | public-read-write | authenticated-read | 
log-delivery-write | bucket-owner-read | bucket-owner-full-control)
fs.s3a.multipart.purge - True if you want to purge existing multipart 
uploads that may not have been completed/aborted correctly (default: false)
fs.s3a.multipart.purge.age - Minimum age in seconds of multipart uploads to 
purge (default: 86400)
fs.s3a.buffer.dir - Comma separated list of directories that will be used 
to buffer file writes out of (default: uses ${hadoop.tmp.dir}/s3a )


*Caveats*:

Hadoop uses a standard output committer which uploads files as filename.COPYING 
before renaming them. This can cause unnecessary performance issues with S3 
because it does not have a rename operation and S3 already verifies uploads 
against an md5 that the driver sets on the upload request. While this 
FileSystem should be significantly faster than the built-in s3native driver 
because of parallel copy support, you may want to consider setting a null 
output committer on our jobs to further improve performance.

Because S3 requires the file length and MD5 to be known before a file is 
uploaded, all output is buffered out to a temporary file first similar to the 
s3native driver.

Due to the lack of native rename() for S3, renaming extremely large files or 
directories make take a while. Unfortunately, there is no way to notify hadoop 
that progress is still being made for rename operations, so your job may time 
out unless you increase the task timeout.

This driver will fully ignore _$folder$ files. This was necessary so that it 
could interoperate with repositories that have had the s3native driver used on 
them, but means that it won't recognize empty directories that s3native has 
been used on.

Statistics for the filesystem may be calculated differently than the s3native 
filesystem. When uploading a file, we do not count writing the temporary file 
on the local filesystem towards the local filesystem's written bytes count. 
When renaming files, we do not count the S3-S3 copy as read or write 
operations. Unlike the s3native driver, we only count bytes written when we 
start the upload (as opposed to the write calls to the temporary local file). 
The driver also counts read  write ops, but they are done mostly to keep from 
timing out on large s3 operations.

The AWS SDK unfortunately passes the multipart threshold as an int which means
fs.s3a.multipart.threshold can not be greater than 2^31-1 (2147483647).

This is currently implemented as a FileSystem and not a AbstractFileSystem.

 

[jira] [Updated] (HADOOP-10400) Incorporate new S3A FileSystem implementation

2014-03-10 Thread Jordan Mendelson (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jordan Mendelson updated HADOOP-10400:
--

Status: Open  (was: Patch Available)

 Incorporate new S3A FileSystem implementation
 -

 Key: HADOOP-10400
 URL: https://issues.apache.org/jira/browse/HADOOP-10400
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs
Reporter: Jordan Mendelson
 Attachments: HADOOP-10400-1.patch, HADOOP-10400-2.patch, 
 HADOOP-10400-3.patch


 The s3native filesystem has a number of limitations (some of which were 
 recently fixed by HADOOP-9454). This patch adds an s3a filesystem which uses 
 the aws-sdk instead of the jets3t library. There are a number of improvements 
 over s3native including:
 - Parallel copy (rename) support (dramatically speeds up commits on large 
 files)
 - AWS S3 explorer compatible empty directories files xyz/ instead of 
 xyz_$folder$ (reduces littering)
 - Ignores s3native created _$folder$ files created by s3native and other S3 
 browsing utilities
 - Supports multiple output buffer dirs to even out IO when uploading files
 - Supports IAM role-based authentication
 - Allows setting a default canned ACL for uploads (public, private, etc.)
 - Better error recovery handling
 - Should handle input seeks without having to download the whole file (used 
 for splits a lot)
 This code is a copy of https://github.com/Aloisius/hadoop-s3a with patches to 
 various pom files to get it to build against trunk. I've been using 0.0.1 in 
 production with CDH 4 for several months and CDH 5 for a few days. The 
 version here is 0.0.2 which changes around some keys to hopefully bring the 
 key name style more inline with the rest of hadoop 2.x.
 *Tunable parameters:*
 fs.s3a.access.key - Your AWS access key ID (omit for role authentication)
 fs.s3a.secret.key - Your AWS secret key (omit for role authentication)
 fs.s3a.connection.maximum - Controls how many parallel connections 
 HttpClient spawns (default: 15)
 fs.s3a.connection.ssl.enabled - Enables or disables SSL connections to S3 
 (default: true)
 fs.s3a.attempts.maximum - How many times we should retry commands on 
 transient errors (default: 10)
 fs.s3a.connection.timeout - Socket connect timeout (default: 5000)
 fs.s3a.paging.maximum - How many keys to request from S3 when doing 
 directory listings at a time (default: 5000)
 fs.s3a.multipart.size - How big (in bytes) to split a upload or copy 
 operation up into (default: 104857600)
 fs.s3a.multipart.threshold - Until a file is this large (in bytes), use 
 non-parallel upload (default: 2147483647)
 fs.s3a.acl.default - Set a canned ACL on newly created/copied objects 
 (private | public-read | public-read-write | authenticated-read | 
 log-delivery-write | bucket-owner-read | bucket-owner-full-control)
 fs.s3a.multipart.purge - True if you want to purge existing multipart 
 uploads that may not have been completed/aborted correctly (default: false)
 fs.s3a.multipart.purge.age - Minimum age in seconds of multipart uploads 
 to purge (default: 86400)
 fs.s3a.buffer.dir - Comma separated list of directories that will be used 
 to buffer file writes out of (default: uses ${hadoop.tmp.dir}/s3a )
 *Caveats*:
 Hadoop uses a standard output committer which uploads files as 
 filename.COPYING before renaming them. This can cause unnecessary performance 
 issues with S3 because it does not have a rename operation and S3 already 
 verifies uploads against an md5 that the driver sets on the upload request. 
 While this FileSystem should be significantly faster than the built-in 
 s3native driver because of parallel copy support, you may want to consider 
 setting a null output committer on our jobs to further improve performance.
 Because S3 requires the file length and MD5 to be known before a file is 
 uploaded, all output is buffered out to a temporary file first similar to the 
 s3native driver.
 Due to the lack of native rename() for S3, renaming extremely large files or 
 directories make take a while. Unfortunately, there is no way to notify 
 hadoop that progress is still being made for rename operations, so your job 
 may time out unless you increase the task timeout.
 This driver will fully ignore _$folder$ files. This was necessary so that it 
 could interoperate with repositories that have had the s3native driver used 
 on them, but means that it won't recognize empty directories that s3native 
 has been used on.
 Statistics for the filesystem may be calculated differently than the s3native 
 filesystem. When uploading a file, we do not count writing the temporary file 
 on the local filesystem towards the local filesystem's written bytes count. 
 When renaming files, we do not count the S3-S3 copy as read 

[jira] [Updated] (HADOOP-10400) Incorporate new S3A FileSystem implementation

2014-03-10 Thread Jordan Mendelson (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jordan Mendelson updated HADOOP-10400:
--

Status: Patch Available  (was: Open)

 Incorporate new S3A FileSystem implementation
 -

 Key: HADOOP-10400
 URL: https://issues.apache.org/jira/browse/HADOOP-10400
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs
Reporter: Jordan Mendelson
 Attachments: HADOOP-10400-1.patch, HADOOP-10400-2.patch, 
 HADOOP-10400-3.patch


 The s3native filesystem has a number of limitations (some of which were 
 recently fixed by HADOOP-9454). This patch adds an s3a filesystem which uses 
 the aws-sdk instead of the jets3t library. There are a number of improvements 
 over s3native including:
 - Parallel copy (rename) support (dramatically speeds up commits on large 
 files)
 - AWS S3 explorer compatible empty directories files xyz/ instead of 
 xyz_$folder$ (reduces littering)
 - Ignores s3native created _$folder$ files created by s3native and other S3 
 browsing utilities
 - Supports multiple output buffer dirs to even out IO when uploading files
 - Supports IAM role-based authentication
 - Allows setting a default canned ACL for uploads (public, private, etc.)
 - Better error recovery handling
 - Should handle input seeks without having to download the whole file (used 
 for splits a lot)
 This code is a copy of https://github.com/Aloisius/hadoop-s3a with patches to 
 various pom files to get it to build against trunk. I've been using 0.0.1 in 
 production with CDH 4 for several months and CDH 5 for a few days. The 
 version here is 0.0.2 which changes around some keys to hopefully bring the 
 key name style more inline with the rest of hadoop 2.x.
 *Tunable parameters:*
 fs.s3a.access.key - Your AWS access key ID (omit for role authentication)
 fs.s3a.secret.key - Your AWS secret key (omit for role authentication)
 fs.s3a.connection.maximum - Controls how many parallel connections 
 HttpClient spawns (default: 15)
 fs.s3a.connection.ssl.enabled - Enables or disables SSL connections to S3 
 (default: true)
 fs.s3a.attempts.maximum - How many times we should retry commands on 
 transient errors (default: 10)
 fs.s3a.connection.timeout - Socket connect timeout (default: 5000)
 fs.s3a.paging.maximum - How many keys to request from S3 when doing 
 directory listings at a time (default: 5000)
 fs.s3a.multipart.size - How big (in bytes) to split a upload or copy 
 operation up into (default: 104857600)
 fs.s3a.multipart.threshold - Until a file is this large (in bytes), use 
 non-parallel upload (default: 2147483647)
 fs.s3a.acl.default - Set a canned ACL on newly created/copied objects 
 (private | public-read | public-read-write | authenticated-read | 
 log-delivery-write | bucket-owner-read | bucket-owner-full-control)
 fs.s3a.multipart.purge - True if you want to purge existing multipart 
 uploads that may not have been completed/aborted correctly (default: false)
 fs.s3a.multipart.purge.age - Minimum age in seconds of multipart uploads 
 to purge (default: 86400)
 fs.s3a.buffer.dir - Comma separated list of directories that will be used 
 to buffer file writes out of (default: uses ${hadoop.tmp.dir}/s3a )
 *Caveats*:
 Hadoop uses a standard output committer which uploads files as 
 filename.COPYING before renaming them. This can cause unnecessary performance 
 issues with S3 because it does not have a rename operation and S3 already 
 verifies uploads against an md5 that the driver sets on the upload request. 
 While this FileSystem should be significantly faster than the built-in 
 s3native driver because of parallel copy support, you may want to consider 
 setting a null output committer on our jobs to further improve performance.
 Because S3 requires the file length and MD5 to be known before a file is 
 uploaded, all output is buffered out to a temporary file first similar to the 
 s3native driver.
 Due to the lack of native rename() for S3, renaming extremely large files or 
 directories make take a while. Unfortunately, there is no way to notify 
 hadoop that progress is still being made for rename operations, so your job 
 may time out unless you increase the task timeout.
 This driver will fully ignore _$folder$ files. This was necessary so that it 
 could interoperate with repositories that have had the s3native driver used 
 on them, but means that it won't recognize empty directories that s3native 
 has been used on.
 Statistics for the filesystem may be calculated differently than the s3native 
 filesystem. When uploading a file, we do not count writing the temporary file 
 on the local filesystem towards the local filesystem's written bytes count. 
 When renaming files, we do not count the S3-S3 copy as read