Hi Brian,
I think the main goal would be to make a python package that could be pip
installed independently of apache_beam.  That goal could be accomplished
with option 3, thus preserving all of the benefits of a monorepo. If it
gains enough popularity and contributors outside of the Beam community,
then options 1 and 2 could be considered to make it easier to foster a new
community of contributors.

Beam has a lot of great tech in it, and it makes me think of Celery, which
is a much older python project of a similar ilk that spawned a series of
useful independent projects: kombu [1], an AMQP messaging library, and
billiard [2], a multiprocessing library.

Obviously, there are a number of pros and cons to consider.  The cons are
pretty clear: even within a monorepo it will make the Beam build more
complicated.  The pros are a bit more abstract.  The fileIO project could
appeal to a broader audience, and act as a signpost for Beam (on PyPI,
etc), thereby increasing awareness of Beam amongst the types of
cloud-friendly python developers who would need the fileIO package.

-chad

[1] https://github.com/celery/kombu
[2] https://github.com/celery/billiard




On Thu, May 20, 2021 at 7:57 AM Brian Hulette <bhule...@google.com> wrote:

> That's an interesting idea. What do you mean by its own project? A couple
> of possibilities:
> - Spinning off a new ASF project
> - A separate Beam-governed repository (e.g. apache/beam-filesystems)
> - More clearly separate it in the current build system and release
> artifacts that allow it to be used independently
>
> Personally I'd be resistant to the first two (I am a Google engineer and I
> like monorepos after all), but I don't see a major problem with the last
> one, except that it gives us another surface to maintain.
>
> Brian
>
> On Wed, May 19, 2021 at 8:38 PM Chad Dombrova <chad...@gmail.com> wrote:
>
>> This is a random idea, but the whole file IO system inside Beam would
>> actually be awesome to extract into its own project.  IIRC, it’s not
>> particularly tied to Beam.
>>
>> I’m not saying this should be done now, but it’s be nice to keep it mind
>> for a future goal.
>>
>> -chad
>>
>>
>>
>> On Wed, May 19, 2021 at 10:23 AM Pablo Estrada <pabl...@google.com>
>> wrote:
>>
>>> That would be great to add, Matt. Of course it's important to make this
>>> backwards compatible, but other than that, the addition would be very
>>> welcome.
>>>
>>> On Wed, May 19, 2021 at 9:41 AM Matt Rudary <matt.rud...@twosigma.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> This is a quick sketch of a proposal – I wanted to get a sense of
>>>> whether there’s general support for this idea before fleshing it out
>>>> further, getting internal approvals, etc.
>>>>
>>>>
>>>>
>>>> I’m working with multiple storage systems that speak the S3 api. I
>>>> would like to support FileIO operations for these storage systems, but
>>>> S3FileSystem hardcodes the s3 scheme (the various systems use different URI
>>>> schemes) and it is in any case impossible to instantiate more than one in
>>>> the current design.
>>>>
>>>>
>>>>
>>>> I’d like to refactor the code in org.apache.beam.sdk.io.aws.s3 (and
>>>> maybe …aws.options) somewhat to enable this use-case. I haven’t worked out
>>>> the details yet, but it will take some thought to make this work in a
>>>> non-hacky way.
>>>>
>>>>
>>>>
>>>> Thanks
>>>>
>>>> Matt Rudary
>>>>
>>>

Reply via email to