[jira] [Commented] (BEAM-2500) Add support for S3 as a Apache Beam FileSystem

Chamikara Jayalath (JIRA) Thu, 14 Sep 2017 10:45:13 -0700

    [ 
https://issues.apache.org/jira/browse/BEAM-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16166678#comment-16166678
 ]


Chamikara Jayalath commented on BEAM-2500:
------------------------------------------

Thanks for looking into this Jacob. I'll try to answer some of your questions.

I agree that GCS is a good template. It's pretty well battle tested and used by 
all Dataflow pipelines.

5GB limit for FileSystem.copy(), though might be a good start, might not be 
enough for production use. Beam can parallelize reading of a large files. So 
there might be many users who are used to reading a small number of large files 
from their pipelines. Note that we use copy() operation to finalize files 
written using FileBasedSink implementations. So we might need to copy large 
files there as well. It'll be good if copy() can be implemented using multipart 
upload as you mentioned.

FileSystem.create() is to create an empty WrittableByteChannel that will be 
written to later. So we'll have to have a way to stream bytes into S3 (some 
implementation of WrittableByteChannel). I'm not sure if S3 client library 
already supports this.

's3' for schema sounds good to me. 

For efficient parallelized reading FileSystem.open() should return an 
efficiently seekable SeekableByteChannel.

Using copy() + delete() combination for rename() is fine.

We might have to address issues related to providing credentials for accessing 
S3. See following JIRA where some details related to this were discussed when 
[~demeshchuk] looked into adding a S3 file system for Python SDK.
https://issues.apache.org/jira/browse/BEAM-2572

> Add support for S3 as a Apache Beam FileSystem
> ----------------------------------------------
>
>                 Key: BEAM-2500
>                 URL: https://issues.apache.org/jira/browse/BEAM-2500
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-java-extensions
>            Reporter: Luke Cwik
>            Priority: Minor
>         Attachments: hadoop_fs_patch.patch
>
>
> Note that this is for providing direct integration with S3 as an Apache Beam 
> FileSystem.
> There is already support for using the Hadoop S3 connector by depending on 
> the Hadoop File System module[1], configuring HadoopFileSystemOptions[2] with 
> a S3 configuration[3].
> 1: https://github.com/apache/beam/tree/master/sdks/java/io/hadoop-file-system
> 2: 
> https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-file-system/src/main/java/org/apache/beam/sdk/io/hdfs/HadoopFileSystemOptions.java#L53
> 3: https://wiki.apache.org/hadoop/AmazonS3



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (BEAM-2500) Add support for S3 as a Apache Beam FileSystem

Reply via email to