As is, how does VFS improve upon the current FileSystem solution?

How much work is it before VFS supports the Apache Beam usecases  (bulk
operations, glob support)?

Is it the right direction for the VFS project to support the above changes?
(things that are important to a data parallel processing system aren't
always important for a filesystem implementation.)

For example, Apache Beam relies on bulk match, bulk delete, bulk rename to
be able to do things within FileIO efficiently (Datalow has had a bunch of
experience where renaming one file at a time even when using multiple
threads is quite slow when you have a million files to rename so having
bulk APIs is important). It has a registration mechanism and ties into
PipelineOptions pretty well. In my opinion the largest deficiency I see
with FileSystems is that we should have used URIs[1] instead of abstract
resource types since we could standardize how URIs are resolved and how
globs work for them, allowing FileSystem authors to implement even less.
The tricky part is how does a URI map onto the file system correctly.

1: https://issues.apache.org/jira/browse/BEAM-2283

On Mon, Mar 5, 2018 at 1:45 PM, Romain Manni-Bucau <rmannibu...@gmail.com>
wrote:

>
>
> Le 5 mars 2018 22:26, "Robert Bradshaw" <rober...@google.com> a écrit :
>
> First, let's try to make the terminology abundantly clear, as I for one
> have (I think) misinterpreted what has been proposed.
>
> VfsFileSystem: A subclass of https://github.com/apache/b
> eam/blob/9f81fd299bd32e0d6056a7da9fa994cf74db0ed9/sdks/java/
> core/src/main/java/org/apache/beam/sdk/io/FileSystem.java
>
> VfsIO: A replacement for https://github.com/apache/
> beam/blob/1e84e49e253f8833f28f1268bec3813029f582d0/sdks/
> java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java written using
> Vfs instead of https://github.com/apache/beam/blob/29859eb54d05b96a9db47
> 7e7bb04537510273bd2/sdks/java/core/src/main/java/org/apache/
> beam/sdk/io/FileSystems.java
>
>
>
> Ack
>
>
> Between these two options, VfsFileSystem is the way to go. It will allow
> us to use all our existing File sources/sinks (including all the fancy
> watching/streaming support from FileIO) with any filesystem supported by
> Vcs. Long-term, if VFS is good enough (and we'll be able to do direct
> experiments of HadoopFileSystem vs. VfsFileSystem-on-Hadoop) we could
> consider moving to VFS entirely and even removing the layer of indirection.
> Vfs is a filesystem, this is the right level of abstraction to plug into.
> Even if it's lacking in some respects, it may still be worth keeping in
> parallel to the existing FileSystem implementations long-term if it has
> significantly better coverage.
>
>
> Ok
>
>
> On the other hand, a re-implementation of FileIO on top of Vfs seems like
> a lot of duplication of code (and ongoing maintenance cost) and will be
> difficult to build on top of (e.g. the binding of TextIO to FileIO is not
> dynamic like the binding of filesystems).
>
>
> Well it shouldnt. Let me clarify my view: we - as asf and not just beam -
> can make both project growing from that work and be more mature and
> interoperable with the existing ecosystem (who does impl a beam filesystem
> when providing a new filesystem). Interesting thing is recent java version
> have a filesystem absstraction  too but this one is harder to make evolving
> for our need. High level goal is to keep it ecosystem friendly and not
> create yet another one.
>
>
>
>
> On Mon, Mar 5, 2018 at 1:05 PM Romain Manni-Bucau <rmannibu...@gmail.com>
> wrote:
>
>> Not backing vfs by a filesystem sounds saner so VfsIO is probably the way
>> to go. It would be a FileIO concurrent and hopefully replacement on the
>> mid/long term.
>>
>> What about doing the opposite: implementing a vfs filesystem for all the
>> fs we support, potentially enrich vfs if needed? Then we can just drop beam
>> abstraction from what i read.
>>
>> Le 5 mars 2018 20:49, "Reuven Lax" <re...@google.com> a écrit :
>>
>>> terminology is confusing here, since the existing FileIO is a
>>> PTransform. VfsFilesystem would be a better name.
>>>
>>>
>>> On Mon, Mar 5, 2018 at 11:46 AM Robert Bradshaw <rober...@google.com>
>>> wrote:
>>>
>>>> On Mon, Mar 5, 2018 at 11:38 AM Reuven Lax <re...@google.com> wrote:
>>>>
>>>>> What about a beam Filesystem impl on top of Vfs as an alternative
>>>>> short-term solution? This would allow Vfs to be used with any IO.
>>>>>
>>>>
>>>> Yes, I think this is the VfsIO that was proposed.
>>>>
>>>>
>>>>> On Mon, Mar 5, 2018 at 11:37 AM Robert Bradshaw <rober...@google.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> On Mon, Mar 5, 2018 at 11:23 AM Romain Manni-Bucau <
>>>>>> rmannibu...@gmail.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> 2018-03-05 20:04 GMT+01:00 Chamikara Jayalath <chamik...@google.com>
>>>>>>> :
>>>>>>>
>>>>>>>> I assume you mean https://commons.apache.org/proper/commons-vfs/.
>>>>>>>>
>>>>>>>> I'm not sure if we considered this when we originally implemented
>>>>>>>> our own file-system abstraction but based on a quick look seems like 
>>>>>>>> this
>>>>>>>> is Java only.
>>>>>>>>
>>>>>>>
>>>>>>> Yes, java only
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> I think having a similar file-system abstraction for various
>>>>>>>> languages is a plus point for Beam. May be we should consider a Java
>>>>>>>> file-system implementation for VFS ?
>>>>>>>>
>>>>>>>
>>>>>>> Can be an option but when I see the current complexity I'm not sure
>>>>>>> mixing 2 abstractions would help, maybe just a VfsIO for java users 
>>>>>>> would
>>>>>>> be good enough - thinking out loud.
>>>>>>>
>>>>>>> What sounds clear to me is that each language will need its own
>>>>>>> abstraction - which kind of join your proposal. However we can still 
>>>>>>> make
>>>>>>> it smooth and easy on the java side - which
>>>>>>> will likely stay mainstream for still some years - using vfs as our
>>>>>>> java impl instead of reimplementing the full abstraction? This way we 
>>>>>>> keep
>>>>>>> our *API* but we drop beam *impl* to just reuse VFS.
>>>>>>>
>>>>>>> PS: for gcs https://github.com/ltouati/vfs-gcs can be a good
>>>>>>> example on how it can work.
>>>>>>>
>>>>>>
>>>>>> I think a VfsIO makes a lot of sense in the short term, and will give
>>>>>> use the experience needed to decide if we can move solely to VFS (for 
>>>>>> Java
>>>>>> at least) for implementation, and possibly API in a future major release,
>>>>>> in the long run.
>>>>>>
>>>>>
>

Reply via email to