[ 
https://issues.apache.org/jira/browse/BEAM-12435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17355184#comment-17355184
 ] 

Matt Rudary commented on BEAM-12435:
------------------------------------

My proposed design is to do the following (for both aws and aws2 packages):

1. Add a public class, S3FileSystemConfiguration, that mostly maps to the 
S3Options, plus a Scheme field.

2. Add a public interface, S3FileSystemSchemeRegistrar, designed for use with 
AutoService. It will have a method that takes a PipelineOptions and returns an 
Iterable of S3FileSystemConfiguration. This will be the way that users register 
their S3 uri schemes with the system.

3. Add an implementation of S3FileSystemSchemeRegistrar for the s3 scheme that 
uses the S3Options from PipelineOptions to populate its 
S3FileSystemConfiguration, maintaining the current behavior by default.

4. Modify S3FileSystem's constructor to take an S3FileSystemConfiguration 
object instead of an S3Options, and make the relevant changes.

5. Modify S3FileSystemRegistrar to load all the AutoService'd file system 
configurations, raising an exception if multiple scheme registrars attempt to 
register the same scheme.

 

I considered alternative methods of configuration, in particular by using some 
configuration file as in HadoopFileSystemOptions. In the end, I decided that 
the AutoService approach was better. First, it seems to me more common to do 
things this way within Beam. Second, unlike with Hadoop, there's no commonly 
used configuration for these types of file systems already in use, and it's not 
clear the best way to deal with this (YAML? JSON? Java Properties? XML?). 
Finally, I think the story for composing multiple registrars is better than the 
story for composing multiple configuration files; for example, this use case 
may make sense in case you are dealing with multiple storage vendors.

> Generalize S3FileSystem
> -----------------------
>
>                 Key: BEAM-12435
>                 URL: https://issues.apache.org/jira/browse/BEAM-12435
>             Project: Beam
>          Issue Type: Improvement
>          Components: io-java-aws
>            Reporter: Matt Rudary
>            Priority: P2
>              Labels: aws, aws-s3
>
> I'm working with multiple storage systems that speak the S3 api. I would like 
> to support FileIO operations for these storage systems, but S3FileSystem 
> hardcodes the s3 scheme (the various systems use different URI schemes) and 
> it is in any case impossible to instantiate more than one in the current 
> design.
> I'd like to refactor the code in org.apache.beam.sdk.io.aws.s3 (and maybe 
> ...aws.options) somewhat to enable this use-case. I haven't worked out the 
> details yet, but it will take some thought to make this work in a non-hacky 
> way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to