created https://issues.apache.org/jira/browse/BEAM-3786 to track the
discussion (without putting too much details in the ticket for now)


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
<https://www.packtpub.com/application-development/java-ee-8-high-performance>

2018-03-06 11:27 GMT+01:00 Romain Manni-Bucau <rmannibu...@gmail.com>:

> @Reuven: this was what I had in mind yes.
>
>
> Romain Manni-Bucau
> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
> <https://rmannibucau.metawerx.net/> | Old Blog
> <http://rmannibucau.wordpress.com> | Github
> <https://github.com/rmannibucau> | LinkedIn
> <https://www.linkedin.com/in/rmannibucau> | Book
> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>
> 2018-03-06 11:24 GMT+01:00 Reuven Lax <re...@google.com>:
>
>> Part of the point of the current Filesystem class _is_ to handle these
>> things (e.g. bulk delete/rename). If Vfs doesn't, then maybe the right
>> answer is to keep Filesystem but put Vfs under it (and maybe that will
>> eventually allow us to remove some of the current code).
>>
>>
>> On Mon, Mar 5, 2018 at 10:02 PM Romain Manni-Bucau <rmannibu...@gmail.com>
>> wrote:
>>
>>>
>>>
>>> Le 6 mars 2018 01:05, "Lukasz Cwik" <lc...@google.com> a écrit :
>>>
>>> As is, how does VFS improve upon the current FileSystem solution?
>>>
>>>
>>> This is about the ecosystem. I never saw beam fs implemented outside
>>> beam but saw a tons of vfs users and impl.
>>>
>>>
>>>
>>> How much work is it before VFS supports the Apache Beam usecases  (bulk
>>> operations, glob support)?
>>>
>>>
>>> I dont expect vfs to handle glob but i expect beam to handle them on top
>>> of vfs. Said otherwise glob matching is independent of the fs impl.
>>>
>>> Bulk is a good thing but think it can be handled as well I think. Best
>>> would be to support it on top of vfs.
>>>
>>> I see vfs as  the pure connectivity layer allowing beam to split and
>>> parallel process data and not as a complete replacement.
>>>
>>>
>>>
>>> Is it the right direction for the VFS project to support the above
>>> changes? (things that are important to a data parallel processing
>>> system aren't always important for a filesystem implementation.)
>>>
>>>
>>> Bulk will be. Distributed peocessing will stay in beam, parallel
>>> processing is important for vfs IMHO since it targets plain batch (like
>>> jbatch) as well.
>>>
>>>
>>> For example, Apache Beam relies on bulk match, bulk delete, bulk rename
>>> to be able to do things within FileIO efficiently (Datalow has had a
>>> bunch of experience where renaming one file at a time even when using
>>> multiple threads is quite slow when you have a million files to rename so
>>> having bulk APIs is important). It has a registration mechanism and
>>> ties into PipelineOptions pretty well. In my opinion the largest deficiency
>>> I see with FileSystems is that we should have used URIs[1] instead of
>>> abstract resource types since we could standardize how URIs are resolved
>>> and how globs work for them, allowing FileSystem authors to implement even
>>> less. The tricky part is how does a URI map onto the file system correctly.
>>>
>>>
>>> Sounds like something unrelated to vfs right? That said it is not too
>>> late to use a fallback mecanism when parsing the path?
>>>
>>>
>>>
>>> 1: https://issues.apache.org/jira/browse/BEAM-2283
>>>
>>> On Mon, Mar 5, 2018 at 1:45 PM, Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>
>>>>
>>>>
>>>> Le 5 mars 2018 22:26, "Robert Bradshaw" <rober...@google.com> a écrit :
>>>>
>>>> First, let's try to make the terminology abundantly clear, as I for one
>>>> have (I think) misinterpreted what has been proposed.
>>>>
>>>> VfsFileSystem: A subclass of https://github.com/apache/b
>>>> eam/blob/9f81fd299bd32e0d6056a7da9fa994cf74db0ed9/sdks/java/
>>>> core/src/main/java/org/apache/beam/sdk/io/FileSystem.java
>>>>
>>>> VfsIO: A replacement for https://github.com/apache/
>>>> beam/blob/1e84e49e253f8833f28f1268bec3813029f582d0/sdks/
>>>> java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java written
>>>> using Vfs instead of https://github.com/apache/b
>>>> eam/blob/29859eb54d05b96a9db477e7bb04537510273bd2/sdks/java/
>>>> core/src/main/java/org/apache/beam/sdk/io/FileSystems.java
>>>>
>>>>
>>>>
>>>> Ack
>>>>
>>>>
>>>> Between these two options, VfsFileSystem is the way to go. It will
>>>> allow us to use all our existing File sources/sinks (including all the
>>>> fancy watching/streaming support from FileIO) with any filesystem supported
>>>> by Vcs. Long-term, if VFS is good enough (and we'll be able to do direct
>>>> experiments of HadoopFileSystem vs. VfsFileSystem-on-Hadoop) we could
>>>> consider moving to VFS entirely and even removing the layer of indirection.
>>>> Vfs is a filesystem, this is the right level of abstraction to plug into.
>>>> Even if it's lacking in some respects, it may still be worth keeping in
>>>> parallel to the existing FileSystem implementations long-term if it has
>>>> significantly better coverage.
>>>>
>>>>
>>>> Ok
>>>>
>>>>
>>>> On the other hand, a re-implementation of FileIO on top of Vfs seems
>>>> like a lot of duplication of code (and ongoing maintenance cost) and will
>>>> be difficult to build on top of (e.g. the binding of TextIO to FileIO is
>>>> not dynamic like the binding of filesystems).
>>>>
>>>>
>>>> Well it shouldnt. Let me clarify my view: we - as asf and not just beam
>>>> - can make both project growing from that work and be more mature and
>>>> interoperable with the existing ecosystem (who does impl a beam filesystem
>>>> when providing a new filesystem). Interesting thing is recent java version
>>>> have a filesystem absstraction  too but this one is harder to make evolving
>>>> for our need. High level goal is to keep it ecosystem friendly and not
>>>> create yet another one.
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Mar 5, 2018 at 1:05 PM Romain Manni-Bucau <
>>>> rmannibu...@gmail.com> wrote:
>>>>
>>>>> Not backing vfs by a filesystem sounds saner so VfsIO is probably the
>>>>> way to go. It would be a FileIO concurrent and hopefully replacement on 
>>>>> the
>>>>> mid/long term.
>>>>>
>>>>> What about doing the opposite: implementing a vfs filesystem for all
>>>>> the fs we support, potentially enrich vfs if needed? Then we can just drop
>>>>> beam abstraction from what i read.
>>>>>
>>>>> Le 5 mars 2018 20:49, "Reuven Lax" <re...@google.com> a écrit :
>>>>>
>>>>>> terminology is confusing here, since the existing FileIO is a
>>>>>> PTransform. VfsFilesystem would be a better name.
>>>>>>
>>>>>>
>>>>>> On Mon, Mar 5, 2018 at 11:46 AM Robert Bradshaw <rober...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> On Mon, Mar 5, 2018 at 11:38 AM Reuven Lax <re...@google.com> wrote:
>>>>>>>
>>>>>>>> What about a beam Filesystem impl on top of Vfs as an alternative
>>>>>>>> short-term solution? This would allow Vfs to be used with any IO.
>>>>>>>>
>>>>>>>
>>>>>>> Yes, I think this is the VfsIO that was proposed.
>>>>>>>
>>>>>>>
>>>>>>>> On Mon, Mar 5, 2018 at 11:37 AM Robert Bradshaw <
>>>>>>>> rober...@google.com> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Mar 5, 2018 at 11:23 AM Romain Manni-Bucau <
>>>>>>>>> rmannibu...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2018-03-05 20:04 GMT+01:00 Chamikara Jayalath <
>>>>>>>>>> chamik...@google.com>:
>>>>>>>>>>
>>>>>>>>>>> I assume you mean https://commons.apache.org/proper/commons-vfs/
>>>>>>>>>>> .
>>>>>>>>>>>
>>>>>>>>>>> I'm not sure if we considered this when we originally
>>>>>>>>>>> implemented our own file-system abstraction but based on a quick 
>>>>>>>>>>> look seems
>>>>>>>>>>> like this is Java only.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Yes, java only
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I think having a similar file-system abstraction for various
>>>>>>>>>>> languages is a plus point for Beam. May be we should consider a Java
>>>>>>>>>>> file-system implementation for VFS ?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Can be an option but when I see the current complexity I'm not
>>>>>>>>>> sure mixing 2 abstractions would help, maybe just a VfsIO for java 
>>>>>>>>>> users
>>>>>>>>>> would be good enough - thinking out loud.
>>>>>>>>>>
>>>>>>>>>> What sounds clear to me is that each language will need its own
>>>>>>>>>> abstraction - which kind of join your proposal. However we can still 
>>>>>>>>>> make
>>>>>>>>>> it smooth and easy on the java side - which
>>>>>>>>>> will likely stay mainstream for still some years - using vfs as
>>>>>>>>>> our java impl instead of reimplementing the full abstraction? This 
>>>>>>>>>> way we
>>>>>>>>>> keep our *API* but we drop beam *impl* to just reuse VFS.
>>>>>>>>>>
>>>>>>>>>> PS: for gcs https://github.com/ltouati/vfs-gcs can be a good
>>>>>>>>>> example on how it can work.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I think a VfsIO makes a lot of sense in the short term, and will
>>>>>>>>> give use the experience needed to decide if we can move solely to VFS 
>>>>>>>>> (for
>>>>>>>>> Java at least) for implementation, and possibly API in a future major
>>>>>>>>> release, in the long run.
>>>>>>>>>
>>>>>>>>
>>>>
>>>
>>>
>

Reply via email to