[jira] [Commented] (BEAM-2500) Add support for S3 as a Apache Beam FileSystem

Steve Loughran (JIRA) Fri, 15 Sep 2017 02:31:32 -0700

    [ 
https://issues.apache.org/jira/browse/BEAM-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167608#comment-16167608
 ]


Steve Loughran commented on BEAM-2500:
--------------------------------------

This is how Hadoop does its multipart upload

OutputStream which switching to MPU once the amount of data > the block size

https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3ABlockOutputStream.java

Option to use: Heap, ByteBuffer of HDD pool for storage
https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3ADataBlocks.java

Default is HDD because the others are fussier about thread configuration, a 
mismatch between generation rate and upload bandwidth will cause OOM failures, 
which invariably happens on a distcp half way through.

If you want to evolve Hadoop FS APIs for better blobstore integration, that's 
something to play with (HADOOP-9565 has discussed it for ages). Issue: broad 
set of differences between them and the lowest common denominator is too 
limited. Justification: making things look like a directory tree with 
operations like rename() is even worse —and there is no copy() in the API at 
present

I'd go for the core set of verbs: PUT, LIST, COPY, plus the ability to query 
the FS for its semantics (consistency model, etc). A cross-store multipart 
upload would be trickier

> Add support for S3 as a Apache Beam FileSystem
> ----------------------------------------------
>
>                 Key: BEAM-2500
>                 URL: https://issues.apache.org/jira/browse/BEAM-2500
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-java-extensions
>            Reporter: Luke Cwik
>            Priority: Minor
>         Attachments: hadoop_fs_patch.patch
>
>
> Note that this is for providing direct integration with S3 as an Apache Beam 
> FileSystem.
> There is already support for using the Hadoop S3 connector by depending on 
> the Hadoop File System module[1], configuring HadoopFileSystemOptions[2] with 
> a S3 configuration[3].
> 1: https://github.com/apache/beam/tree/master/sdks/java/io/hadoop-file-system
> 2: 
> https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-file-system/src/main/java/org/apache/beam/sdk/io/hdfs/HadoopFileSystemOptions.java#L53
> 3: https://wiki.apache.org/hadoop/AmazonS3



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (BEAM-2500) Add support for S3 as a Apache Beam FileSystem

Reply via email to