@Reuven: this was what I had in mind yes.

Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
<https://www.packtpub.com/application-development/java-ee-8-high-performance>

2018-03-06 11:24 GMT+01:00 Reuven Lax <re...@google.com>:

> Part of the point of the current Filesystem class _is_ to handle these
> things (e.g. bulk delete/rename). If Vfs doesn't, then maybe the right
> answer is to keep Filesystem but put Vfs under it (and maybe that will
> eventually allow us to remove some of the current code).
>
>
> On Mon, Mar 5, 2018 at 10:02 PM Romain Manni-Bucau <rmannibu...@gmail.com>
> wrote:
>
>>
>>
>> Le 6 mars 2018 01:05, "Lukasz Cwik" <lc...@google.com> a écrit :
>>
>> As is, how does VFS improve upon the current FileSystem solution?
>>
>>
>> This is about the ecosystem. I never saw beam fs implemented outside beam
>> but saw a tons of vfs users and impl.
>>
>>
>>
>> How much work is it before VFS supports the Apache Beam usecases  (bulk
>> operations, glob support)?
>>
>>
>> I dont expect vfs to handle glob but i expect beam to handle them on top
>> of vfs. Said otherwise glob matching is independent of the fs impl.
>>
>> Bulk is a good thing but think it can be handled as well I think. Best
>> would be to support it on top of vfs.
>>
>> I see vfs as  the pure connectivity layer allowing beam to split and
>> parallel process data and not as a complete replacement.
>>
>>
>>
>> Is it the right direction for the VFS project to support the above
>> changes? (things that are important to a data parallel processing system
>> aren't always important for a filesystem implementation.)
>>
>>
>> Bulk will be. Distributed peocessing will stay in beam, parallel
>> processing is important for vfs IMHO since it targets plain batch (like
>> jbatch) as well.
>>
>>
>> For example, Apache Beam relies on bulk match, bulk delete, bulk rename
>> to be able to do things within FileIO efficiently (Datalow has had a
>> bunch of experience where renaming one file at a time even when using
>> multiple threads is quite slow when you have a million files to rename so
>> having bulk APIs is important). It has a registration mechanism and ties
>> into PipelineOptions pretty well. In my opinion the largest deficiency I
>> see with FileSystems is that we should have used URIs[1] instead of
>> abstract resource types since we could standardize how URIs are resolved
>> and how globs work for them, allowing FileSystem authors to implement even
>> less. The tricky part is how does a URI map onto the file system correctly.
>>
>>
>> Sounds like something unrelated to vfs right? That said it is not too
>> late to use a fallback mecanism when parsing the path?
>>
>>
>>
>> 1: https://issues.apache.org/jira/browse/BEAM-2283
>>
>> On Mon, Mar 5, 2018 at 1:45 PM, Romain Manni-Bucau <rmannibu...@gmail.com
>> > wrote:
>>
>>>
>>>
>>> Le 5 mars 2018 22:26, "Robert Bradshaw" <rober...@google.com> a écrit :
>>>
>>> First, let's try to make the terminology abundantly clear, as I for one
>>> have (I think) misinterpreted what has been proposed.
>>>
>>> VfsFileSystem: A subclass of https://github.com/apache/beam/blob/
>>> 9f81fd299bd32e0d6056a7da9fa994cf74db0ed9/sdks/java/core/src/
>>> main/java/org/apache/beam/sdk/io/FileSystem.java
>>>
>>> VfsIO: A replacement for https://github.com/apache/beam/blob/
>>> 1e84e49e253f8833f28f1268bec3813029f582d0/sdks/java/core/src/
>>> main/java/org/apache/beam/sdk/io/FileIO.java written using Vfs instead
>>> of https://github.com/apache/beam/blob/29859eb54d05b96a9db477e7bb0453
>>> 7510273bd2/sdks/java/core/src/main/java/org/apache/beam/sdk/
>>> io/FileSystems.java
>>>
>>>
>>>
>>> Ack
>>>
>>>
>>> Between these two options, VfsFileSystem is the way to go. It will allow
>>> us to use all our existing File sources/sinks (including all the fancy
>>> watching/streaming support from FileIO) with any filesystem supported by
>>> Vcs. Long-term, if VFS is good enough (and we'll be able to do direct
>>> experiments of HadoopFileSystem vs. VfsFileSystem-on-Hadoop) we could
>>> consider moving to VFS entirely and even removing the layer of indirection.
>>> Vfs is a filesystem, this is the right level of abstraction to plug into.
>>> Even if it's lacking in some respects, it may still be worth keeping in
>>> parallel to the existing FileSystem implementations long-term if it has
>>> significantly better coverage.
>>>
>>>
>>> Ok
>>>
>>>
>>> On the other hand, a re-implementation of FileIO on top of Vfs seems
>>> like a lot of duplication of code (and ongoing maintenance cost) and will
>>> be difficult to build on top of (e.g. the binding of TextIO to FileIO is
>>> not dynamic like the binding of filesystems).
>>>
>>>
>>> Well it shouldnt. Let me clarify my view: we - as asf and not just beam
>>> - can make both project growing from that work and be more mature and
>>> interoperable with the existing ecosystem (who does impl a beam filesystem
>>> when providing a new filesystem). Interesting thing is recent java version
>>> have a filesystem absstraction  too but this one is harder to make evolving
>>> for our need. High level goal is to keep it ecosystem friendly and not
>>> create yet another one.
>>>
>>>
>>>
>>>
>>> On Mon, Mar 5, 2018 at 1:05 PM Romain Manni-Bucau <rmannibu...@gmail.com>
>>> wrote:
>>>
>>>> Not backing vfs by a filesystem sounds saner so VfsIO is probably the
>>>> way to go. It would be a FileIO concurrent and hopefully replacement on the
>>>> mid/long term.
>>>>
>>>> What about doing the opposite: implementing a vfs filesystem for all
>>>> the fs we support, potentially enrich vfs if needed? Then we can just drop
>>>> beam abstraction from what i read.
>>>>
>>>> Le 5 mars 2018 20:49, "Reuven Lax" <re...@google.com> a écrit :
>>>>
>>>>> terminology is confusing here, since the existing FileIO is a
>>>>> PTransform. VfsFilesystem would be a better name.
>>>>>
>>>>>
>>>>> On Mon, Mar 5, 2018 at 11:46 AM Robert Bradshaw <rober...@google.com>
>>>>> wrote:
>>>>>
>>>>>> On Mon, Mar 5, 2018 at 11:38 AM Reuven Lax <re...@google.com> wrote:
>>>>>>
>>>>>>> What about a beam Filesystem impl on top of Vfs as an alternative
>>>>>>> short-term solution? This would allow Vfs to be used with any IO.
>>>>>>>
>>>>>>
>>>>>> Yes, I think this is the VfsIO that was proposed.
>>>>>>
>>>>>>
>>>>>>> On Mon, Mar 5, 2018 at 11:37 AM Robert Bradshaw <rober...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Mar 5, 2018 at 11:23 AM Romain Manni-Bucau <
>>>>>>>> rmannibu...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2018-03-05 20:04 GMT+01:00 Chamikara Jayalath <
>>>>>>>>> chamik...@google.com>:
>>>>>>>>>
>>>>>>>>>> I assume you mean https://commons.apache.org/proper/commons-vfs/.
>>>>>>>>>>
>>>>>>>>>> I'm not sure if we considered this when we originally implemented
>>>>>>>>>> our own file-system abstraction but based on a quick look seems like 
>>>>>>>>>> this
>>>>>>>>>> is Java only.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Yes, java only
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I think having a similar file-system abstraction for various
>>>>>>>>>> languages is a plus point for Beam. May be we should consider a Java
>>>>>>>>>> file-system implementation for VFS ?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Can be an option but when I see the current complexity I'm not
>>>>>>>>> sure mixing 2 abstractions would help, maybe just a VfsIO for java 
>>>>>>>>> users
>>>>>>>>> would be good enough - thinking out loud.
>>>>>>>>>
>>>>>>>>> What sounds clear to me is that each language will need its own
>>>>>>>>> abstraction - which kind of join your proposal. However we can still 
>>>>>>>>> make
>>>>>>>>> it smooth and easy on the java side - which
>>>>>>>>> will likely stay mainstream for still some years - using vfs as
>>>>>>>>> our java impl instead of reimplementing the full abstraction? This 
>>>>>>>>> way we
>>>>>>>>> keep our *API* but we drop beam *impl* to just reuse VFS.
>>>>>>>>>
>>>>>>>>> PS: for gcs https://github.com/ltouati/vfs-gcs can be a good
>>>>>>>>> example on how it can work.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I think a VfsIO makes a lot of sense in the short term, and will
>>>>>>>> give use the experience needed to decide if we can move solely to VFS 
>>>>>>>> (for
>>>>>>>> Java at least) for implementation, and possibly API in a future major
>>>>>>>> release, in the long run.
>>>>>>>>
>>>>>>>
>>>
>>
>>

Reply via email to