Re: Proposal: Generalize S3FileSystem

Kenneth Knowles Fri, 21 May 2021 16:08:59 -0700

Please follow URL intention if at all possible. Specifically the bits
before the : should indicate how to parse the rest of the URL, not other
information. Is this convention of sticking the host before the : already
an established thing for s3-compatible endpoints?


If the various S3-compatible providers have their own schemes, is it
possible to just register the same code with different config for those
schemes and not invent any new URLs? That would be ideal.

Kenn

On Thu, May 20, 2021 at 2:30 PM Charles Chen <[email protected]> wrote:

> Is it feasible to keep the endpoint information in the path?  It seems
> pretty desirable to keep URIs "universal" so that it's possible to
> understand what is being pointed to without explicit service configuration,
> so maybe you can have a scheme like "s3+endpoint=api.example.com
> ://my/bucket/path"?
>
> On Thu, May 20, 2021 at 12:31 PM Kenneth Knowles <[email protected]> wrote:
>
>> $.02
>>
>> Most important is community to maintain it. It cannot be a separate
>> project or subproject (lots of ASF projects have this, so they share
>> governance) without that.
>>
>> To add additional friction of separate release and dependency in build
>> before you have community, it should be extremely stable so you upgrade
>> rarely. See the process of upgrading our vendored deps. It is considerable.
>>
>> Kenn
>>
>> On Thu, May 20, 2021 at 12:07 PM Stephan Hoyer <[email protected]> wrote:
>>
>>> On Thu, May 20, 2021 at 10:12 AM Chad Dombrova <[email protected]>
>>> wrote:
>>>
>>>> Hi Brian,
>>>> I think the main goal would be to make a python package that could be
>>>> pip installed independently of apache_beam.  That goal could be
>>>> accomplished with option 3, thus preserving all of the benefits of a
>>>> monorepo. If it gains enough popularity and contributors outside of the
>>>> Beam community, then options 1 and 2 could be considered to make it easier
>>>> to foster a new community of contributors.
>>>>
>>>
>>> This sounds like a lovely goal!
>>>
>>> I'll just mention the "fsspec" Python project, which came out of Dask:
>>> https://filesystem-spec.readthedocs.io/en/latest/
>>>
>>> As far as I can tell, it serves basically this exact same purpose
>>> (generic filesystems with high-performance IO), and has started to get some
>>> traction in other projects, e.g., it's now used in pandas. I don't know if
>>> it would be suitable for Beam, but it might be worth a try.
>>>
>>> Cheers,
>>> Stephan
>>>
>>>
>>>> Beam has a lot of great tech in it, and it makes me think of Celery,
>>>> which is a much older python project of a similar ilk that spawned a series
>>>> of useful independent projects: kombu [1], an AMQP messaging library, and
>>>> billiard [2], a multiprocessing library.
>>>>
>>>> Obviously, there are a number of pros and cons to consider.  The cons
>>>> are pretty clear: even within a monorepo it will make the Beam build more
>>>> complicated.  The pros are a bit more abstract.  The fileIO project could
>>>> appeal to a broader audience, and act as a signpost for Beam (on PyPI,
>>>> etc), thereby increasing awareness of Beam amongst the types of
>>>> cloud-friendly python developers who would need the fileIO package.
>>>>
>>>> -chad
>>>>
>>>> [1] https://github.com/celery/kombu
>>>> [2] https://github.com/celery/billiard
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, May 20, 2021 at 7:57 AM Brian Hulette <[email protected]>
>>>> wrote:
>>>>
>>>>> That's an interesting idea. What do you mean by its own project? A
>>>>> couple of possibilities:
>>>>> - Spinning off a new ASF project
>>>>> - A separate Beam-governed repository (e.g. apache/beam-filesystems)
>>>>> - More clearly separate it in the current build system and release
>>>>> artifacts that allow it to be used independently
>>>>>
>>>>> Personally I'd be resistant to the first two (I am a Google engineer
>>>>> and I like monorepos after all), but I don't see a major problem with the
>>>>> last one, except that it gives us another surface to maintain.
>>>>>
>>>>> Brian
>>>>>
>>>>> On Wed, May 19, 2021 at 8:38 PM Chad Dombrova <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> This is a random idea, but the whole file IO system inside Beam would
>>>>>> actually be awesome to extract into its own project.  IIRC, it’s not
>>>>>> particularly tied to Beam.
>>>>>>
>>>>>> I’m not saying this should be done now, but it’s be nice to keep it
>>>>>> mind for a future goal.
>>>>>>
>>>>>> -chad
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, May 19, 2021 at 10:23 AM Pablo Estrada <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> That would be great to add, Matt. Of course it's important to make
>>>>>>> this backwards compatible, but other than that, the addition would be 
>>>>>>> very
>>>>>>> welcome.
>>>>>>>
>>>>>>> On Wed, May 19, 2021 at 9:41 AM Matt Rudary <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> This is a quick sketch of a proposal – I wanted to get a sense of
>>>>>>>> whether there’s general support for this idea before fleshing it out
>>>>>>>> further, getting internal approvals, etc.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I’m working with multiple storage systems that speak the S3 api. I
>>>>>>>> would like to support FileIO operations for these storage systems, but
>>>>>>>> S3FileSystem hardcodes the s3 scheme (the various systems use 
>>>>>>>> different URI
>>>>>>>> schemes) and it is in any case impossible to instantiate more than one 
>>>>>>>> in
>>>>>>>> the current design.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I’d like to refactor the code in org.apache.beam.sdk.io.aws.s3 (and
>>>>>>>> maybe …aws.options) somewhat to enable this use-case. I haven’t worked 
>>>>>>>> out
>>>>>>>> the details yet, but it will take some thought to make this work in a
>>>>>>>> non-hacky way.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> Matt Rudary
>>>>>>>>
>>>>>>>

Re: Proposal: Generalize S3FileSystem

Reply via email to