[ 
https://issues.apache.org/jira/browse/BEAM-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16165391#comment-16165391
 ] 

Jacob Marble edited comment on BEAM-2500 at 9/14/17 4:57 AM:
-------------------------------------------------------------

I'm interested in implementing S3 support. Not being familiar Beam internals, 
and without committing myself to anything, perhaps someone can comment on my 
research notes.

GCS is probably a good template. Implement FileSystem, ResourceId, 
FileSystemRegistrar, PipelineOptions, PipelineOptionsRegistrar:
https://github.com/apache/beam/tree/master/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp

For interacting with S3, this is probably the preferred SDK:
http://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-s3

Some specifics about implementing FileSystem:

FileSystem.copy()
- AmazonS3Client.copyObject((String sourceBucketName, String sourceKey, String 
destinationBucketName, String destinationKey)
- max upload size is 5GB, which is probably fine to start, but need to use 
multipart upload to get full 5TB limit

FileSystem.create()
- AmazonS3Client.putObject(putObject(String bucketName, String key, InputStream 
input, ObjectMetadata metadata)
- max upload size is 5GB, which is probably fine to start, but need to use 
multipart upload to get full 5TB limit

FileSystem.delete()
- AmazonS3Client.deleteObjects(DeleteObjectsRequest deleteObjectsRequest)

FileSystem.getScheme()
- return "s3"

FileSystem.match()
- j.o.apache.beam.sdk.extensions.util.gcsfs.GcsPath and same.GcsUtil have some 
good ideas

FileSystem.matchNewResource()
- Look at GcsPath and GcsUtil

FileSystem.open()
- AmazonS3Client.getObject(String bucketName, String key)

FileSystem.rename()
- Can't find anything in AmazonS3Client; perhaps call FileSystem.copy(), then 
FileSystem.delete()

I'm not clear about how to register the s3 FileSystem as mentioned in the 
FileSystemRegistrar Javadoc:

"FileSystem creators have the ability to provide a registrar by creating a 
ServiceLoader entry and a concrete implementation of this interface.

It is optional but recommended to use one of the many build time tools such as 
AutoService to generate the necessary META-INF files automatically."


was (Author: jmarble):
I'm interested in implementing S3 support. Not being familiar Beam internals, 
and without committing myself to anything, perhaps someone can comment on my 
research notes.

GCS is probably a good template. Implement FileSystem, ResourceId, 
FileSystemRegistrar, PathValidator, PipelineOptions, PipelineOptionsRegistrar:
https://github.com/apache/beam/tree/master/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp

For interacting with S3, this is probably the preferred SDK:
http://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-s3

Some specifics about implementing FileSystem:

FileSystem.copy()
- AmazonS3Client.copyObject((String sourceBucketName, String sourceKey, String 
destinationBucketName, String destinationKey)
- max upload size is 5GB, which is probably fine to start, but need to use 
multipart upload to get full 5TB limit

FileSystem.create()
- AmazonS3Client.putObject(putObject(String bucketName, String key, InputStream 
input, ObjectMetadata metadata)
- max upload size is 5GB, which is probably fine to start, but need to use 
multipart upload to get full 5TB limit

FileSystem.delete()
- AmazonS3Client.deleteObjects(DeleteObjectsRequest deleteObjectsRequest)

FileSystem.getScheme()
- return "s3"

FileSystem.match()
- j.o.apache.beam.sdk.extensions.util.gcsfs.GcsPath and same.GcsUtil have some 
good ideas

FileSystem.matchNewResource()
- Look at GcsPath and GcsUtil

FileSystem.open()
- AmazonS3Client.getObject(String bucketName, String key)

FileSystem.rename()
- Can't find anything in AmazonS3Client; perhaps call FileSystem.copy(), then 
FileSystem.delete()

I'm not clear about how to register the s3 FileSystem as mentioned in the 
FileSystemRegistrar Javadoc:

"FileSystem creators have the ability to provide a registrar by creating a 
ServiceLoader entry and a concrete implementation of this interface.

It is optional but recommended to use one of the many build time tools such as 
AutoService to generate the necessary META-INF files automatically."

> Add support for S3 as a Apache Beam FileSystem
> ----------------------------------------------
>
>                 Key: BEAM-2500
>                 URL: https://issues.apache.org/jira/browse/BEAM-2500
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-java-extensions
>            Reporter: Luke Cwik
>            Priority: Minor
>         Attachments: hadoop_fs_patch.patch
>
>
> Note that this is for providing direct integration with S3 as an Apache Beam 
> FileSystem.
> There is already support for using the Hadoop S3 connector by depending on 
> the Hadoop File System module[1], configuring HadoopFileSystemOptions[2] with 
> a S3 configuration[3].
> 1: https://github.com/apache/beam/tree/master/sdks/java/io/hadoop-file-system
> 2: 
> https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-file-system/src/main/java/org/apache/beam/sdk/io/hdfs/HadoopFileSystemOptions.java#L53
> 3: https://wiki.apache.org/hadoop/AmazonS3



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to