On Thu, May 20, 2021 at 10:12 AM Chad Dombrova <chad...@gmail.com> wrote:

> Hi Brian,
> I think the main goal would be to make a python package that could be pip
> installed independently of apache_beam.  That goal could be accomplished
> with option 3, thus preserving all of the benefits of a monorepo. If it
> gains enough popularity and contributors outside of the Beam community,
> then options 1 and 2 could be considered to make it easier to foster a new
> community of contributors.
>

This sounds like a lovely goal!

I'll just mention the "fsspec" Python project, which came out of Dask:
https://filesystem-spec.readthedocs.io/en/latest/

As far as I can tell, it serves basically this exact same purpose (generic
filesystems with high-performance IO), and has started to get some traction
in other projects, e.g., it's now used in pandas. I don't know if it would
be suitable for Beam, but it might be worth a try.

Cheers,
Stephan


> Beam has a lot of great tech in it, and it makes me think of Celery, which
> is a much older python project of a similar ilk that spawned a series of
> useful independent projects: kombu [1], an AMQP messaging library, and
> billiard [2], a multiprocessing library.
>
> Obviously, there are a number of pros and cons to consider.  The cons are
> pretty clear: even within a monorepo it will make the Beam build more
> complicated.  The pros are a bit more abstract.  The fileIO project could
> appeal to a broader audience, and act as a signpost for Beam (on PyPI,
> etc), thereby increasing awareness of Beam amongst the types of
> cloud-friendly python developers who would need the fileIO package.
>
> -chad
>
> [1] https://github.com/celery/kombu
> [2] https://github.com/celery/billiard
>
>
>
>
> On Thu, May 20, 2021 at 7:57 AM Brian Hulette <bhule...@google.com> wrote:
>
>> That's an interesting idea. What do you mean by its own project? A couple
>> of possibilities:
>> - Spinning off a new ASF project
>> - A separate Beam-governed repository (e.g. apache/beam-filesystems)
>> - More clearly separate it in the current build system and release
>> artifacts that allow it to be used independently
>>
>> Personally I'd be resistant to the first two (I am a Google engineer and
>> I like monorepos after all), but I don't see a major problem with the last
>> one, except that it gives us another surface to maintain.
>>
>> Brian
>>
>> On Wed, May 19, 2021 at 8:38 PM Chad Dombrova <chad...@gmail.com> wrote:
>>
>>> This is a random idea, but the whole file IO system inside Beam would
>>> actually be awesome to extract into its own project.  IIRC, it’s not
>>> particularly tied to Beam.
>>>
>>> I’m not saying this should be done now, but it’s be nice to keep it mind
>>> for a future goal.
>>>
>>> -chad
>>>
>>>
>>>
>>> On Wed, May 19, 2021 at 10:23 AM Pablo Estrada <pabl...@google.com>
>>> wrote:
>>>
>>>> That would be great to add, Matt. Of course it's important to make this
>>>> backwards compatible, but other than that, the addition would be very
>>>> welcome.
>>>>
>>>> On Wed, May 19, 2021 at 9:41 AM Matt Rudary <matt.rud...@twosigma.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>>
>>>>> This is a quick sketch of a proposal – I wanted to get a sense of
>>>>> whether there’s general support for this idea before fleshing it out
>>>>> further, getting internal approvals, etc.
>>>>>
>>>>>
>>>>>
>>>>> I’m working with multiple storage systems that speak the S3 api. I
>>>>> would like to support FileIO operations for these storage systems, but
>>>>> S3FileSystem hardcodes the s3 scheme (the various systems use different 
>>>>> URI
>>>>> schemes) and it is in any case impossible to instantiate more than one in
>>>>> the current design.
>>>>>
>>>>>
>>>>>
>>>>> I’d like to refactor the code in org.apache.beam.sdk.io.aws.s3 (and
>>>>> maybe …aws.options) somewhat to enable this use-case. I haven’t worked out
>>>>> the details yet, but it will take some thought to make this work in a
>>>>> non-hacky way.
>>>>>
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>> Matt Rudary
>>>>>
>>>>

Reply via email to