RE: Proposal: Generalize S3FileSystem

2021-06-01 Thread Matt Rudary
I've filed https://issues.apache.org/jira/browse/BEAM-12435 to track this 
improvement.

From: Matt Rudary 
Sent: Monday, May 24, 2021 4:49 PM
To: dev@beam.apache.org
Subject: Re: Proposal: Generalize S3FileSystem


Thanks for the comments all. I forgot to subscribe to dev before I sent out the 
email, so this response isn't threaded properly.



My proposed design is to do the following (for both aws and aws2 packages):

1.   Add a public class, S3FileSystemConfiguration, that mostly maps to the 
S3Options, plus a Scheme field.

2.   Add a public interface, S3FileSystemSchemeRegistrar, designed for use 
with AutoService. It will have a method that takes a PipelineOptions and 
returns an Iterable of S3FileSystemConfiguration. This will be the way that 
users register their S3 uri schemes with the system.

3.   Add an implementation of S3FileSystemSchemeRegistrar for the s3 scheme 
that uses the S3Options from PipelineOptions to populate its 
S3FileSystemConfiguration, maintaining the current behavior by default.

4.   Modify S3FileSystem's constructor to take an S3FileSystemConfiguration 
object instead of an S3Options, and make the relevant changes.

5.   Modify S3FileSystemRegistrar to load all the AutoService'd file system 
configurations, raising an exception if multiple scheme registrars attempt to 
register the same scheme.



I considered alternative methods of configuration, in particular by using some 
configuration file as in HadoopFileSystemOptions. In the end, I decided that 
the AutoService approach was better. First, it seems to me more common to do 
things this way within Beam. Second, unlike with Hadoop, there's no commonly 
used configuration for these types of file systems already in use, and it's not 
clear the best way to deal with this (YAML? JSON? Java Properties? XML?). 
Finally, I think the story for composing multiple registrars is better than the 
story for composing multiple configuration files; for example, this use case 
may make sense in case you are dealing with multiple storage vendors.



Matt



On 2021/05/19 13:27:16, Matt Rudary 
mailto:m...@twosigma.com>> wrote:

> Hi,>

>

> This is a quick sketch of a proposal - I wanted to get a sense of whether 
> there's general support for this idea before fleshing it out further, getting 
> internal approvals, etc.>

>

> I'm working with multiple storage systems that speak the S3 api. I would like 
> to support FileIO operations for these storage systems, but S3FileSystem 
> hardcodes the s3 scheme (the various systems use different URI schemes) and 
> it is in any case impossible to instantiate more than one in the current 
> design.>

>

> I'd like to refactor the code in org.apache.beam.sdk.io.aws.s3 (and maybe 
> ...aws.options) somewhat to enable this use-case. I haven't worked out the 
> details yet, but it will take some thought to make this work in a non-hacky 
> way.>

>

> Thanks>

> Matt Rudary>

>


Re: Proposal: Generalize S3FileSystem

2021-05-24 Thread Matt Rudary
Thanks for the comments all. I forgot to subscribe to dev before I sent out the 
email, so this response isn't threaded properly.



My proposed design is to do the following (for both aws and aws2 packages):

1.   Add a public class, S3FileSystemConfiguration, that mostly maps to the 
S3Options, plus a Scheme field.

2.   Add a public interface, S3FileSystemSchemeRegistrar, designed for use 
with AutoService. It will have a method that takes a PipelineOptions and 
returns an Iterable of S3FileSystemConfiguration. This will be the way that 
users register their S3 uri schemes with the system.

3.   Add an implementation of S3FileSystemSchemeRegistrar for the s3 scheme 
that uses the S3Options from PipelineOptions to populate its 
S3FileSystemConfiguration, maintaining the current behavior by default.

4.   Modify S3FileSystem's constructor to take an S3FileSystemConfiguration 
object instead of an S3Options, and make the relevant changes.

5.   Modify S3FileSystemRegistrar to load all the AutoService'd file system 
configurations, raising an exception if multiple scheme registrars attempt to 
register the same scheme.



I considered alternative methods of configuration, in particular by using some 
configuration file as in HadoopFileSystemOptions. In the end, I decided that 
the AutoService approach was better. First, it seems to me more common to do 
things this way within Beam. Second, unlike with Hadoop, there's no commonly 
used configuration for these types of file systems already in use, and it's not 
clear the best way to deal with this (YAML? JSON? Java Properties? XML?). 
Finally, I think the story for composing multiple registrars is better than the 
story for composing multiple configuration files; for example, this use case 
may make sense in case you are dealing with multiple storage vendors.



Matt



On 2021/05/19 13:27:16, Matt Rudary 
mailto:m...@twosigma.com>> wrote:

> Hi,>

>

> This is a quick sketch of a proposal - I wanted to get a sense of whether 
> there's general support for this idea before fleshing it out further, getting 
> internal approvals, etc.>

>

> I'm working with multiple storage systems that speak the S3 api. I would like 
> to support FileIO operations for these storage systems, but S3FileSystem 
> hardcodes the s3 scheme (the various systems use different URI schemes) and 
> it is in any case impossible to instantiate more than one in the current 
> design.>

>

> I'd like to refactor the code in org.apache.beam.sdk.io.aws.s3 (and maybe 
> ...aws.options) somewhat to enable this use-case. I haven't worked out the 
> details yet, but it will take some thought to make this work in a non-hacky 
> way.>

>

> Thanks>

> Matt Rudary>

>


Re: Proposal: Generalize S3FileSystem

2021-05-21 Thread Kenneth Knowles
Please follow URL intention if at all possible. Specifically the bits
before the : should indicate how to parse the rest of the URL, not other
information. Is this convention of sticking the host before the : already
an established thing for s3-compatible endpoints?

If the various S3-compatible providers have their own schemes, is it
possible to just register the same code with different config for those
schemes and not invent any new URLs? That would be ideal.

Kenn

On Thu, May 20, 2021 at 2:30 PM Charles Chen  wrote:

> Is it feasible to keep the endpoint information in the path?  It seems
> pretty desirable to keep URIs "universal" so that it's possible to
> understand what is being pointed to without explicit service configuration,
> so maybe you can have a scheme like "s3+endpoint=api.example.com
> ://my/bucket/path"?
>
> On Thu, May 20, 2021 at 12:31 PM Kenneth Knowles  wrote:
>
>> $.02
>>
>> Most important is community to maintain it. It cannot be a separate
>> project or subproject (lots of ASF projects have this, so they share
>> governance) without that.
>>
>> To add additional friction of separate release and dependency in build
>> before you have community, it should be extremely stable so you upgrade
>> rarely. See the process of upgrading our vendored deps. It is considerable.
>>
>> Kenn
>>
>> On Thu, May 20, 2021 at 12:07 PM Stephan Hoyer  wrote:
>>
>>> On Thu, May 20, 2021 at 10:12 AM Chad Dombrova 
>>> wrote:
>>>
 Hi Brian,
 I think the main goal would be to make a python package that could be
 pip installed independently of apache_beam.  That goal could be
 accomplished with option 3, thus preserving all of the benefits of a
 monorepo. If it gains enough popularity and contributors outside of the
 Beam community, then options 1 and 2 could be considered to make it easier
 to foster a new community of contributors.

>>>
>>> This sounds like a lovely goal!
>>>
>>> I'll just mention the "fsspec" Python project, which came out of Dask:
>>> https://filesystem-spec.readthedocs.io/en/latest/
>>>
>>> As far as I can tell, it serves basically this exact same purpose
>>> (generic filesystems with high-performance IO), and has started to get some
>>> traction in other projects, e.g., it's now used in pandas. I don't know if
>>> it would be suitable for Beam, but it might be worth a try.
>>>
>>> Cheers,
>>> Stephan
>>>
>>>
 Beam has a lot of great tech in it, and it makes me think of Celery,
 which is a much older python project of a similar ilk that spawned a series
 of useful independent projects: kombu [1], an AMQP messaging library, and
 billiard [2], a multiprocessing library.

 Obviously, there are a number of pros and cons to consider.  The cons
 are pretty clear: even within a monorepo it will make the Beam build more
 complicated.  The pros are a bit more abstract.  The fileIO project could
 appeal to a broader audience, and act as a signpost for Beam (on PyPI,
 etc), thereby increasing awareness of Beam amongst the types of
 cloud-friendly python developers who would need the fileIO package.

 -chad

 [1] https://github.com/celery/kombu
 [2] https://github.com/celery/billiard




 On Thu, May 20, 2021 at 7:57 AM Brian Hulette 
 wrote:

> That's an interesting idea. What do you mean by its own project? A
> couple of possibilities:
> - Spinning off a new ASF project
> - A separate Beam-governed repository (e.g. apache/beam-filesystems)
> - More clearly separate it in the current build system and release
> artifacts that allow it to be used independently
>
> Personally I'd be resistant to the first two (I am a Google engineer
> and I like monorepos after all), but I don't see a major problem with the
> last one, except that it gives us another surface to maintain.
>
> Brian
>
> On Wed, May 19, 2021 at 8:38 PM Chad Dombrova 
> wrote:
>
>> This is a random idea, but the whole file IO system inside Beam would
>> actually be awesome to extract into its own project.  IIRC, it’s not
>> particularly tied to Beam.
>>
>> I’m not saying this should be done now, but it’s be nice to keep it
>> mind for a future goal.
>>
>> -chad
>>
>>
>>
>> On Wed, May 19, 2021 at 10:23 AM Pablo Estrada 
>> wrote:
>>
>>> That would be great to add, Matt. Of course it's important to make
>>> this backwards compatible, but other than that, the addition would be 
>>> very
>>> welcome.
>>>
>>> On Wed, May 19, 2021 at 9:41 AM Matt Rudary <
>>> matt.rud...@twosigma.com> wrote:
>>>
 Hi,



 This is a quick sketch of a proposal – I wanted to get a sense of
 whether there’s general support for this idea before fleshing it out
 further, getting internal approvals, etc.




Re: Proposal: Generalize S3FileSystem

2021-05-20 Thread Charles Chen
Is it feasible to keep the endpoint information in the path?  It seems
pretty desirable to keep URIs "universal" so that it's possible to
understand what is being pointed to without explicit service configuration,
so maybe you can have a scheme like "s3+endpoint=api.example.com
://my/bucket/path"?

On Thu, May 20, 2021 at 12:31 PM Kenneth Knowles  wrote:

> $.02
>
> Most important is community to maintain it. It cannot be a separate
> project or subproject (lots of ASF projects have this, so they share
> governance) without that.
>
> To add additional friction of separate release and dependency in build
> before you have community, it should be extremely stable so you upgrade
> rarely. See the process of upgrading our vendored deps. It is considerable.
>
> Kenn
>
> On Thu, May 20, 2021 at 12:07 PM Stephan Hoyer  wrote:
>
>> On Thu, May 20, 2021 at 10:12 AM Chad Dombrova  wrote:
>>
>>> Hi Brian,
>>> I think the main goal would be to make a python package that could be
>>> pip installed independently of apache_beam.  That goal could be
>>> accomplished with option 3, thus preserving all of the benefits of a
>>> monorepo. If it gains enough popularity and contributors outside of the
>>> Beam community, then options 1 and 2 could be considered to make it easier
>>> to foster a new community of contributors.
>>>
>>
>> This sounds like a lovely goal!
>>
>> I'll just mention the "fsspec" Python project, which came out of Dask:
>> https://filesystem-spec.readthedocs.io/en/latest/
>>
>> As far as I can tell, it serves basically this exact same purpose
>> (generic filesystems with high-performance IO), and has started to get some
>> traction in other projects, e.g., it's now used in pandas. I don't know if
>> it would be suitable for Beam, but it might be worth a try.
>>
>> Cheers,
>> Stephan
>>
>>
>>> Beam has a lot of great tech in it, and it makes me think of Celery,
>>> which is a much older python project of a similar ilk that spawned a series
>>> of useful independent projects: kombu [1], an AMQP messaging library, and
>>> billiard [2], a multiprocessing library.
>>>
>>> Obviously, there are a number of pros and cons to consider.  The cons
>>> are pretty clear: even within a monorepo it will make the Beam build more
>>> complicated.  The pros are a bit more abstract.  The fileIO project could
>>> appeal to a broader audience, and act as a signpost for Beam (on PyPI,
>>> etc), thereby increasing awareness of Beam amongst the types of
>>> cloud-friendly python developers who would need the fileIO package.
>>>
>>> -chad
>>>
>>> [1] https://github.com/celery/kombu
>>> [2] https://github.com/celery/billiard
>>>
>>>
>>>
>>>
>>> On Thu, May 20, 2021 at 7:57 AM Brian Hulette 
>>> wrote:
>>>
 That's an interesting idea. What do you mean by its own project? A
 couple of possibilities:
 - Spinning off a new ASF project
 - A separate Beam-governed repository (e.g. apache/beam-filesystems)
 - More clearly separate it in the current build system and release
 artifacts that allow it to be used independently

 Personally I'd be resistant to the first two (I am a Google engineer
 and I like monorepos after all), but I don't see a major problem with the
 last one, except that it gives us another surface to maintain.

 Brian

 On Wed, May 19, 2021 at 8:38 PM Chad Dombrova 
 wrote:

> This is a random idea, but the whole file IO system inside Beam would
> actually be awesome to extract into its own project.  IIRC, it’s not
> particularly tied to Beam.
>
> I’m not saying this should be done now, but it’s be nice to keep it
> mind for a future goal.
>
> -chad
>
>
>
> On Wed, May 19, 2021 at 10:23 AM Pablo Estrada 
> wrote:
>
>> That would be great to add, Matt. Of course it's important to make
>> this backwards compatible, but other than that, the addition would be 
>> very
>> welcome.
>>
>> On Wed, May 19, 2021 at 9:41 AM Matt Rudary 
>> wrote:
>>
>>> Hi,
>>>
>>>
>>>
>>> This is a quick sketch of a proposal – I wanted to get a sense of
>>> whether there’s general support for this idea before fleshing it out
>>> further, getting internal approvals, etc.
>>>
>>>
>>>
>>> I’m working with multiple storage systems that speak the S3 api. I
>>> would like to support FileIO operations for these storage systems, but
>>> S3FileSystem hardcodes the s3 scheme (the various systems use different 
>>> URI
>>> schemes) and it is in any case impossible to instantiate more than one 
>>> in
>>> the current design.
>>>
>>>
>>>
>>> I’d like to refactor the code in org.apache.beam.sdk.io.aws.s3 (and
>>> maybe …aws.options) somewhat to enable this use-case. I haven’t worked 
>>> out
>>> the details yet, but it will take some thought to make this work in a
>>> non-hacky way.
>>>
>>>
>>>
>>> 

Re: Proposal: Generalize S3FileSystem

2021-05-20 Thread Kenneth Knowles
$.02

Most important is community to maintain it. It cannot be a separate project
or subproject (lots of ASF projects have this, so they share governance)
without that.

To add additional friction of separate release and dependency in build
before you have community, it should be extremely stable so you upgrade
rarely. See the process of upgrading our vendored deps. It is considerable.

Kenn

On Thu, May 20, 2021 at 12:07 PM Stephan Hoyer  wrote:

> On Thu, May 20, 2021 at 10:12 AM Chad Dombrova  wrote:
>
>> Hi Brian,
>> I think the main goal would be to make a python package that could be pip
>> installed independently of apache_beam.  That goal could be accomplished
>> with option 3, thus preserving all of the benefits of a monorepo. If it
>> gains enough popularity and contributors outside of the Beam community,
>> then options 1 and 2 could be considered to make it easier to foster a new
>> community of contributors.
>>
>
> This sounds like a lovely goal!
>
> I'll just mention the "fsspec" Python project, which came out of Dask:
> https://filesystem-spec.readthedocs.io/en/latest/
>
> As far as I can tell, it serves basically this exact same purpose (generic
> filesystems with high-performance IO), and has started to get some traction
> in other projects, e.g., it's now used in pandas. I don't know if it would
> be suitable for Beam, but it might be worth a try.
>
> Cheers,
> Stephan
>
>
>> Beam has a lot of great tech in it, and it makes me think of Celery,
>> which is a much older python project of a similar ilk that spawned a series
>> of useful independent projects: kombu [1], an AMQP messaging library, and
>> billiard [2], a multiprocessing library.
>>
>> Obviously, there are a number of pros and cons to consider.  The cons are
>> pretty clear: even within a monorepo it will make the Beam build more
>> complicated.  The pros are a bit more abstract.  The fileIO project could
>> appeal to a broader audience, and act as a signpost for Beam (on PyPI,
>> etc), thereby increasing awareness of Beam amongst the types of
>> cloud-friendly python developers who would need the fileIO package.
>>
>> -chad
>>
>> [1] https://github.com/celery/kombu
>> [2] https://github.com/celery/billiard
>>
>>
>>
>>
>> On Thu, May 20, 2021 at 7:57 AM Brian Hulette 
>> wrote:
>>
>>> That's an interesting idea. What do you mean by its own project? A
>>> couple of possibilities:
>>> - Spinning off a new ASF project
>>> - A separate Beam-governed repository (e.g. apache/beam-filesystems)
>>> - More clearly separate it in the current build system and release
>>> artifacts that allow it to be used independently
>>>
>>> Personally I'd be resistant to the first two (I am a Google engineer and
>>> I like monorepos after all), but I don't see a major problem with the last
>>> one, except that it gives us another surface to maintain.
>>>
>>> Brian
>>>
>>> On Wed, May 19, 2021 at 8:38 PM Chad Dombrova  wrote:
>>>
 This is a random idea, but the whole file IO system inside Beam would
 actually be awesome to extract into its own project.  IIRC, it’s not
 particularly tied to Beam.

 I’m not saying this should be done now, but it’s be nice to keep it
 mind for a future goal.

 -chad



 On Wed, May 19, 2021 at 10:23 AM Pablo Estrada 
 wrote:

> That would be great to add, Matt. Of course it's important to make
> this backwards compatible, but other than that, the addition would be very
> welcome.
>
> On Wed, May 19, 2021 at 9:41 AM Matt Rudary 
> wrote:
>
>> Hi,
>>
>>
>>
>> This is a quick sketch of a proposal – I wanted to get a sense of
>> whether there’s general support for this idea before fleshing it out
>> further, getting internal approvals, etc.
>>
>>
>>
>> I’m working with multiple storage systems that speak the S3 api. I
>> would like to support FileIO operations for these storage systems, but
>> S3FileSystem hardcodes the s3 scheme (the various systems use different 
>> URI
>> schemes) and it is in any case impossible to instantiate more than one in
>> the current design.
>>
>>
>>
>> I’d like to refactor the code in org.apache.beam.sdk.io.aws.s3 (and
>> maybe …aws.options) somewhat to enable this use-case. I haven’t worked 
>> out
>> the details yet, but it will take some thought to make this work in a
>> non-hacky way.
>>
>>
>>
>> Thanks
>>
>> Matt Rudary
>>
>


Re: Proposal: Generalize S3FileSystem

2021-05-20 Thread Stephan Hoyer
On Thu, May 20, 2021 at 10:12 AM Chad Dombrova  wrote:

> Hi Brian,
> I think the main goal would be to make a python package that could be pip
> installed independently of apache_beam.  That goal could be accomplished
> with option 3, thus preserving all of the benefits of a monorepo. If it
> gains enough popularity and contributors outside of the Beam community,
> then options 1 and 2 could be considered to make it easier to foster a new
> community of contributors.
>

This sounds like a lovely goal!

I'll just mention the "fsspec" Python project, which came out of Dask:
https://filesystem-spec.readthedocs.io/en/latest/

As far as I can tell, it serves basically this exact same purpose (generic
filesystems with high-performance IO), and has started to get some traction
in other projects, e.g., it's now used in pandas. I don't know if it would
be suitable for Beam, but it might be worth a try.

Cheers,
Stephan


> Beam has a lot of great tech in it, and it makes me think of Celery, which
> is a much older python project of a similar ilk that spawned a series of
> useful independent projects: kombu [1], an AMQP messaging library, and
> billiard [2], a multiprocessing library.
>
> Obviously, there are a number of pros and cons to consider.  The cons are
> pretty clear: even within a monorepo it will make the Beam build more
> complicated.  The pros are a bit more abstract.  The fileIO project could
> appeal to a broader audience, and act as a signpost for Beam (on PyPI,
> etc), thereby increasing awareness of Beam amongst the types of
> cloud-friendly python developers who would need the fileIO package.
>
> -chad
>
> [1] https://github.com/celery/kombu
> [2] https://github.com/celery/billiard
>
>
>
>
> On Thu, May 20, 2021 at 7:57 AM Brian Hulette  wrote:
>
>> That's an interesting idea. What do you mean by its own project? A couple
>> of possibilities:
>> - Spinning off a new ASF project
>> - A separate Beam-governed repository (e.g. apache/beam-filesystems)
>> - More clearly separate it in the current build system and release
>> artifacts that allow it to be used independently
>>
>> Personally I'd be resistant to the first two (I am a Google engineer and
>> I like monorepos after all), but I don't see a major problem with the last
>> one, except that it gives us another surface to maintain.
>>
>> Brian
>>
>> On Wed, May 19, 2021 at 8:38 PM Chad Dombrova  wrote:
>>
>>> This is a random idea, but the whole file IO system inside Beam would
>>> actually be awesome to extract into its own project.  IIRC, it’s not
>>> particularly tied to Beam.
>>>
>>> I’m not saying this should be done now, but it’s be nice to keep it mind
>>> for a future goal.
>>>
>>> -chad
>>>
>>>
>>>
>>> On Wed, May 19, 2021 at 10:23 AM Pablo Estrada 
>>> wrote:
>>>
 That would be great to add, Matt. Of course it's important to make this
 backwards compatible, but other than that, the addition would be very
 welcome.

 On Wed, May 19, 2021 at 9:41 AM Matt Rudary 
 wrote:

> Hi,
>
>
>
> This is a quick sketch of a proposal – I wanted to get a sense of
> whether there’s general support for this idea before fleshing it out
> further, getting internal approvals, etc.
>
>
>
> I’m working with multiple storage systems that speak the S3 api. I
> would like to support FileIO operations for these storage systems, but
> S3FileSystem hardcodes the s3 scheme (the various systems use different 
> URI
> schemes) and it is in any case impossible to instantiate more than one in
> the current design.
>
>
>
> I’d like to refactor the code in org.apache.beam.sdk.io.aws.s3 (and
> maybe …aws.options) somewhat to enable this use-case. I haven’t worked out
> the details yet, but it will take some thought to make this work in a
> non-hacky way.
>
>
>
> Thanks
>
> Matt Rudary
>



Re: Proposal: Generalize S3FileSystem

2021-05-20 Thread Chad Dombrova
Hi Brian,
I think the main goal would be to make a python package that could be pip
installed independently of apache_beam.  That goal could be accomplished
with option 3, thus preserving all of the benefits of a monorepo. If it
gains enough popularity and contributors outside of the Beam community,
then options 1 and 2 could be considered to make it easier to foster a new
community of contributors.

Beam has a lot of great tech in it, and it makes me think of Celery, which
is a much older python project of a similar ilk that spawned a series of
useful independent projects: kombu [1], an AMQP messaging library, and
billiard [2], a multiprocessing library.

Obviously, there are a number of pros and cons to consider.  The cons are
pretty clear: even within a monorepo it will make the Beam build more
complicated.  The pros are a bit more abstract.  The fileIO project could
appeal to a broader audience, and act as a signpost for Beam (on PyPI,
etc), thereby increasing awareness of Beam amongst the types of
cloud-friendly python developers who would need the fileIO package.

-chad

[1] https://github.com/celery/kombu
[2] https://github.com/celery/billiard




On Thu, May 20, 2021 at 7:57 AM Brian Hulette  wrote:

> That's an interesting idea. What do you mean by its own project? A couple
> of possibilities:
> - Spinning off a new ASF project
> - A separate Beam-governed repository (e.g. apache/beam-filesystems)
> - More clearly separate it in the current build system and release
> artifacts that allow it to be used independently
>
> Personally I'd be resistant to the first two (I am a Google engineer and I
> like monorepos after all), but I don't see a major problem with the last
> one, except that it gives us another surface to maintain.
>
> Brian
>
> On Wed, May 19, 2021 at 8:38 PM Chad Dombrova  wrote:
>
>> This is a random idea, but the whole file IO system inside Beam would
>> actually be awesome to extract into its own project.  IIRC, it’s not
>> particularly tied to Beam.
>>
>> I’m not saying this should be done now, but it’s be nice to keep it mind
>> for a future goal.
>>
>> -chad
>>
>>
>>
>> On Wed, May 19, 2021 at 10:23 AM Pablo Estrada 
>> wrote:
>>
>>> That would be great to add, Matt. Of course it's important to make this
>>> backwards compatible, but other than that, the addition would be very
>>> welcome.
>>>
>>> On Wed, May 19, 2021 at 9:41 AM Matt Rudary 
>>> wrote:
>>>
 Hi,



 This is a quick sketch of a proposal – I wanted to get a sense of
 whether there’s general support for this idea before fleshing it out
 further, getting internal approvals, etc.



 I’m working with multiple storage systems that speak the S3 api. I
 would like to support FileIO operations for these storage systems, but
 S3FileSystem hardcodes the s3 scheme (the various systems use different URI
 schemes) and it is in any case impossible to instantiate more than one in
 the current design.



 I’d like to refactor the code in org.apache.beam.sdk.io.aws.s3 (and
 maybe …aws.options) somewhat to enable this use-case. I haven’t worked out
 the details yet, but it will take some thought to make this work in a
 non-hacky way.



 Thanks

 Matt Rudary

>>>


Re: Proposal: Generalize S3FileSystem

2021-05-20 Thread Brian Hulette
That's an interesting idea. What do you mean by its own project? A couple
of possibilities:
- Spinning off a new ASF project
- A separate Beam-governed repository (e.g. apache/beam-filesystems)
- More clearly separate it in the current build system and release
artifacts that allow it to be used independently

Personally I'd be resistant to the first two (I am a Google engineer and I
like monorepos after all), but I don't see a major problem with the last
one, except that it gives us another surface to maintain.

Brian

On Wed, May 19, 2021 at 8:38 PM Chad Dombrova  wrote:

> This is a random idea, but the whole file IO system inside Beam would
> actually be awesome to extract into its own project.  IIRC, it’s not
> particularly tied to Beam.
>
> I’m not saying this should be done now, but it’s be nice to keep it mind
> for a future goal.
>
> -chad
>
>
>
> On Wed, May 19, 2021 at 10:23 AM Pablo Estrada  wrote:
>
>> That would be great to add, Matt. Of course it's important to make this
>> backwards compatible, but other than that, the addition would be very
>> welcome.
>>
>> On Wed, May 19, 2021 at 9:41 AM Matt Rudary 
>> wrote:
>>
>>> Hi,
>>>
>>>
>>>
>>> This is a quick sketch of a proposal – I wanted to get a sense of
>>> whether there’s general support for this idea before fleshing it out
>>> further, getting internal approvals, etc.
>>>
>>>
>>>
>>> I’m working with multiple storage systems that speak the S3 api. I would
>>> like to support FileIO operations for these storage systems, but
>>> S3FileSystem hardcodes the s3 scheme (the various systems use different URI
>>> schemes) and it is in any case impossible to instantiate more than one in
>>> the current design.
>>>
>>>
>>>
>>> I’d like to refactor the code in org.apache.beam.sdk.io.aws.s3 (and
>>> maybe …aws.options) somewhat to enable this use-case. I haven’t worked out
>>> the details yet, but it will take some thought to make this work in a
>>> non-hacky way.
>>>
>>>
>>>
>>> Thanks
>>>
>>> Matt Rudary
>>>
>>


Re: Proposal: Generalize S3FileSystem

2021-05-19 Thread Chad Dombrova
This is a random idea, but the whole file IO system inside Beam would
actually be awesome to extract into its own project.  IIRC, it’s not
particularly tied to Beam.

I’m not saying this should be done now, but it’s be nice to keep it mind
for a future goal.

-chad



On Wed, May 19, 2021 at 10:23 AM Pablo Estrada  wrote:

> That would be great to add, Matt. Of course it's important to make this
> backwards compatible, but other than that, the addition would be very
> welcome.
>
> On Wed, May 19, 2021 at 9:41 AM Matt Rudary 
> wrote:
>
>> Hi,
>>
>>
>>
>> This is a quick sketch of a proposal – I wanted to get a sense of whether
>> there’s general support for this idea before fleshing it out further,
>> getting internal approvals, etc.
>>
>>
>>
>> I’m working with multiple storage systems that speak the S3 api. I would
>> like to support FileIO operations for these storage systems, but
>> S3FileSystem hardcodes the s3 scheme (the various systems use different URI
>> schemes) and it is in any case impossible to instantiate more than one in
>> the current design.
>>
>>
>>
>> I’d like to refactor the code in org.apache.beam.sdk.io.aws.s3 (and maybe
>> …aws.options) somewhat to enable this use-case. I haven’t worked out the
>> details yet, but it will take some thought to make this work in a non-hacky
>> way.
>>
>>
>>
>> Thanks
>>
>> Matt Rudary
>>
>


Re: Proposal: Generalize S3FileSystem

2021-05-19 Thread Pablo Estrada
That would be great to add, Matt. Of course it's important to make this
backwards compatible, but other than that, the addition would be very
welcome.

On Wed, May 19, 2021 at 9:41 AM Matt Rudary 
wrote:

> Hi,
>
>
>
> This is a quick sketch of a proposal – I wanted to get a sense of whether
> there’s general support for this idea before fleshing it out further,
> getting internal approvals, etc.
>
>
>
> I’m working with multiple storage systems that speak the S3 api. I would
> like to support FileIO operations for these storage systems, but
> S3FileSystem hardcodes the s3 scheme (the various systems use different URI
> schemes) and it is in any case impossible to instantiate more than one in
> the current design.
>
>
>
> I’d like to refactor the code in org.apache.beam.sdk.io.aws.s3 (and maybe
> …aws.options) somewhat to enable this use-case. I haven’t worked out the
> details yet, but it will take some thought to make this work in a non-hacky
> way.
>
>
>
> Thanks
>
> Matt Rudary
>


Proposal: Generalize S3FileSystem

2021-05-19 Thread Matt Rudary
Hi,

This is a quick sketch of a proposal - I wanted to get a sense of whether 
there's general support for this idea before fleshing it out further, getting 
internal approvals, etc.

I'm working with multiple storage systems that speak the S3 api. I would like 
to support FileIO operations for these storage systems, but S3FileSystem 
hardcodes the s3 scheme (the various systems use different URI schemes) and it 
is in any case impossible to instantiate more than one in the current design.

I'd like to refactor the code in org.apache.beam.sdk.io.aws.s3 (and maybe 
...aws.options) somewhat to enable this use-case. I haven't worked out the 
details yet, but it will take some thought to make this work in a non-hacky way.

Thanks
Matt Rudary