On Thu, May 20, 2021 at 10:12 AM Chad Dombrova <chad...@gmail.com> wrote:
> Hi Brian, > I think the main goal would be to make a python package that could be pip > installed independently of apache_beam. That goal could be accomplished > with option 3, thus preserving all of the benefits of a monorepo. If it > gains enough popularity and contributors outside of the Beam community, > then options 1 and 2 could be considered to make it easier to foster a new > community of contributors. > This sounds like a lovely goal! I'll just mention the "fsspec" Python project, which came out of Dask: https://filesystem-spec.readthedocs.io/en/latest/ As far as I can tell, it serves basically this exact same purpose (generic filesystems with high-performance IO), and has started to get some traction in other projects, e.g., it's now used in pandas. I don't know if it would be suitable for Beam, but it might be worth a try. Cheers, Stephan > Beam has a lot of great tech in it, and it makes me think of Celery, which > is a much older python project of a similar ilk that spawned a series of > useful independent projects: kombu [1], an AMQP messaging library, and > billiard [2], a multiprocessing library. > > Obviously, there are a number of pros and cons to consider. The cons are > pretty clear: even within a monorepo it will make the Beam build more > complicated. The pros are a bit more abstract. The fileIO project could > appeal to a broader audience, and act as a signpost for Beam (on PyPI, > etc), thereby increasing awareness of Beam amongst the types of > cloud-friendly python developers who would need the fileIO package. > > -chad > > [1] https://github.com/celery/kombu > [2] https://github.com/celery/billiard > > > > > On Thu, May 20, 2021 at 7:57 AM Brian Hulette <bhule...@google.com> wrote: > >> That's an interesting idea. What do you mean by its own project? A couple >> of possibilities: >> - Spinning off a new ASF project >> - A separate Beam-governed repository (e.g. apache/beam-filesystems) >> - More clearly separate it in the current build system and release >> artifacts that allow it to be used independently >> >> Personally I'd be resistant to the first two (I am a Google engineer and >> I like monorepos after all), but I don't see a major problem with the last >> one, except that it gives us another surface to maintain. >> >> Brian >> >> On Wed, May 19, 2021 at 8:38 PM Chad Dombrova <chad...@gmail.com> wrote: >> >>> This is a random idea, but the whole file IO system inside Beam would >>> actually be awesome to extract into its own project. IIRC, it’s not >>> particularly tied to Beam. >>> >>> I’m not saying this should be done now, but it’s be nice to keep it mind >>> for a future goal. >>> >>> -chad >>> >>> >>> >>> On Wed, May 19, 2021 at 10:23 AM Pablo Estrada <pabl...@google.com> >>> wrote: >>> >>>> That would be great to add, Matt. Of course it's important to make this >>>> backwards compatible, but other than that, the addition would be very >>>> welcome. >>>> >>>> On Wed, May 19, 2021 at 9:41 AM Matt Rudary <matt.rud...@twosigma.com> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> >>>>> >>>>> This is a quick sketch of a proposal – I wanted to get a sense of >>>>> whether there’s general support for this idea before fleshing it out >>>>> further, getting internal approvals, etc. >>>>> >>>>> >>>>> >>>>> I’m working with multiple storage systems that speak the S3 api. I >>>>> would like to support FileIO operations for these storage systems, but >>>>> S3FileSystem hardcodes the s3 scheme (the various systems use different >>>>> URI >>>>> schemes) and it is in any case impossible to instantiate more than one in >>>>> the current design. >>>>> >>>>> >>>>> >>>>> I’d like to refactor the code in org.apache.beam.sdk.io.aws.s3 (and >>>>> maybe …aws.options) somewhat to enable this use-case. I haven’t worked out >>>>> the details yet, but it will take some thought to make this work in a >>>>> non-hacky way. >>>>> >>>>> >>>>> >>>>> Thanks >>>>> >>>>> Matt Rudary >>>>> >>>>