Just to share a bit more than a ticket here is a bootstrap impl
https://github.com/apache/beam/pull/4803


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
<https://www.packtpub.com/application-development/java-ee-8-high-performance>

2018-03-06 12:42 GMT+01:00 Reuven Lax <re...@google.com>:

> Cool. Then for now we should create a separate Vfs-backed Filesystem impl.
> Once Vfs supports all we need, I think we can consider keeping only that.
>
> Keep in mind that the bulk operations Luke mentioned translate to native
> bulk operations for Gcs at least (BatchRequest is part of the Gcs API). I'm
> not entirely sure whether HDFS natively supports this or not. This implies
> that we would need some way of expressing bulk operations through Vfs.
>
>
> On Tue, Mar 6, 2018 at 2:27 AM Romain Manni-Bucau <rmannibu...@gmail.com>
> wrote:
>
>> @Reuven: this was what I had in mind yes.
>>
>>
>> Romain Manni-Bucau
>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>> <https://rmannibucau.metawerx.net/> | Old Blog
>> <http://rmannibucau.wordpress.com> | Github
>> <https://github.com/rmannibucau> | LinkedIn
>> <https://www.linkedin.com/in/rmannibucau> | Book
>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>
>> 2018-03-06 11:24 GMT+01:00 Reuven Lax <re...@google.com>:
>>
>>> Part of the point of the current Filesystem class _is_ to handle these
>>> things (e.g. bulk delete/rename). If Vfs doesn't, then maybe the right
>>> answer is to keep Filesystem but put Vfs under it (and maybe that will
>>> eventually allow us to remove some of the current code).
>>>
>>>
>>> On Mon, Mar 5, 2018 at 10:02 PM Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>
>>>>
>>>>
>>>> Le 6 mars 2018 01:05, "Lukasz Cwik" <lc...@google.com> a écrit :
>>>>
>>>> As is, how does VFS improve upon the current FileSystem solution?
>>>>
>>>>
>>>> This is about the ecosystem. I never saw beam fs implemented outside
>>>> beam but saw a tons of vfs users and impl.
>>>>
>>>>
>>>>
>>>> How much work is it before VFS supports the Apache Beam usecases  (bulk
>>>> operations, glob support)?
>>>>
>>>>
>>>> I dont expect vfs to handle glob but i expect beam to handle them on
>>>> top of vfs. Said otherwise glob matching is independent of the fs impl.
>>>>
>>>> Bulk is a good thing but think it can be handled as well I think. Best
>>>> would be to support it on top of vfs.
>>>>
>>>> I see vfs as  the pure connectivity layer allowing beam to split and
>>>> parallel process data and not as a complete replacement.
>>>>
>>>>
>>>>
>>>> Is it the right direction for the VFS project to support the above
>>>> changes? (things that are important to a data parallel processing
>>>> system aren't always important for a filesystem implementation.)
>>>>
>>>>
>>>> Bulk will be. Distributed peocessing will stay in beam, parallel
>>>> processing is important for vfs IMHO since it targets plain batch (like
>>>> jbatch) as well.
>>>>
>>>>
>>>> For example, Apache Beam relies on bulk match, bulk delete, bulk rename
>>>> to be able to do things within FileIO efficiently (Datalow has had a
>>>> bunch of experience where renaming one file at a time even when using
>>>> multiple threads is quite slow when you have a million files to rename so
>>>> having bulk APIs is important). It has a registration mechanism and
>>>> ties into PipelineOptions pretty well. In my opinion the largest deficiency
>>>> I see with FileSystems is that we should have used URIs[1] instead of
>>>> abstract resource types since we could standardize how URIs are resolved
>>>> and how globs work for them, allowing FileSystem authors to implement even
>>>> less. The tricky part is how does a URI map onto the file system correctly.
>>>>
>>>>
>>>> Sounds like something unrelated to vfs right? That said it is not too
>>>> late to use a fallback mecanism when parsing the path?
>>>>
>>>>
>>>>
>>>> 1: https://issues.apache.org/jira/browse/BEAM-2283
>>>>
>>>> On Mon, Mar 5, 2018 at 1:45 PM, Romain Manni-Bucau <
>>>> rmannibu...@gmail.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> Le 5 mars 2018 22:26, "Robert Bradshaw" <rober...@google.com> a
>>>>> écrit :
>>>>>
>>>>> First, let's try to make the terminology abundantly clear, as I for
>>>>> one have (I think) misinterpreted what has been proposed.
>>>>>
>>>>> VfsFileSystem: A subclass of https://github.com/apache/beam/blob/
>>>>> 9f81fd299bd32e0d6056a7da9fa994cf74db0ed9/sdks/java/core/src/
>>>>> main/java/org/apache/beam/sdk/io/FileSystem.java
>>>>>
>>>>> VfsIO: A replacement for https://github.com/apache/beam/blob/
>>>>> 1e84e49e253f8833f28f1268bec3813029f582d0/sdks/java/core/src/
>>>>> main/java/org/apache/beam/sdk/io/FileIO.java written using Vfs
>>>>> instead of https://github.com/apache/beam/blob/
>>>>> 29859eb54d05b96a9db477e7bb04537510273bd2/sdks/java/core/src/
>>>>> main/java/org/apache/beam/sdk/io/FileSystems.java
>>>>>
>>>>>
>>>>>
>>>>> Ack
>>>>>
>>>>>
>>>>> Between these two options, VfsFileSystem is the way to go. It will
>>>>> allow us to use all our existing File sources/sinks (including all the
>>>>> fancy watching/streaming support from FileIO) with any filesystem 
>>>>> supported
>>>>> by Vcs. Long-term, if VFS is good enough (and we'll be able to do direct
>>>>> experiments of HadoopFileSystem vs. VfsFileSystem-on-Hadoop) we could
>>>>> consider moving to VFS entirely and even removing the layer of 
>>>>> indirection.
>>>>> Vfs is a filesystem, this is the right level of abstraction to plug into.
>>>>> Even if it's lacking in some respects, it may still be worth keeping in
>>>>> parallel to the existing FileSystem implementations long-term if it has
>>>>> significantly better coverage.
>>>>>
>>>>>
>>>>> Ok
>>>>>
>>>>>
>>>>> On the other hand, a re-implementation of FileIO on top of Vfs seems
>>>>> like a lot of duplication of code (and ongoing maintenance cost) and will
>>>>> be difficult to build on top of (e.g. the binding of TextIO to FileIO is
>>>>> not dynamic like the binding of filesystems).
>>>>>
>>>>>
>>>>> Well it shouldnt. Let me clarify my view: we - as asf and not just
>>>>> beam - can make both project growing from that work and be more mature and
>>>>> interoperable with the existing ecosystem (who does impl a beam filesystem
>>>>> when providing a new filesystem). Interesting thing is recent java version
>>>>> have a filesystem absstraction  too but this one is harder to make 
>>>>> evolving
>>>>> for our need. High level goal is to keep it ecosystem friendly and not
>>>>> create yet another one.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Mar 5, 2018 at 1:05 PM Romain Manni-Bucau <
>>>>> rmannibu...@gmail.com> wrote:
>>>>>
>>>>>> Not backing vfs by a filesystem sounds saner so VfsIO is probably the
>>>>>> way to go. It would be a FileIO concurrent and hopefully replacement on 
>>>>>> the
>>>>>> mid/long term.
>>>>>>
>>>>>> What about doing the opposite: implementing a vfs filesystem for all
>>>>>> the fs we support, potentially enrich vfs if needed? Then we can just 
>>>>>> drop
>>>>>> beam abstraction from what i read.
>>>>>>
>>>>>> Le 5 mars 2018 20:49, "Reuven Lax" <re...@google.com> a écrit :
>>>>>>
>>>>>>> terminology is confusing here, since the existing FileIO is a
>>>>>>> PTransform. VfsFilesystem would be a better name.
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Mar 5, 2018 at 11:46 AM Robert Bradshaw <rober...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> On Mon, Mar 5, 2018 at 11:38 AM Reuven Lax <re...@google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> What about a beam Filesystem impl on top of Vfs as an alternative
>>>>>>>>> short-term solution? This would allow Vfs to be used with any IO.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Yes, I think this is the VfsIO that was proposed.
>>>>>>>>
>>>>>>>>
>>>>>>>>> On Mon, Mar 5, 2018 at 11:37 AM Robert Bradshaw <
>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Mar 5, 2018 at 11:23 AM Romain Manni-Bucau <
>>>>>>>>>> rmannibu...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2018-03-05 20:04 GMT+01:00 Chamikara Jayalath <
>>>>>>>>>>> chamik...@google.com>:
>>>>>>>>>>>
>>>>>>>>>>>> I assume you mean https://commons.apache.
>>>>>>>>>>>> org/proper/commons-vfs/.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm not sure if we considered this when we originally
>>>>>>>>>>>> implemented our own file-system abstraction but based on a quick 
>>>>>>>>>>>> look seems
>>>>>>>>>>>> like this is Java only.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Yes, java only
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I think having a similar file-system abstraction for various
>>>>>>>>>>>> languages is a plus point for Beam. May be we should consider a 
>>>>>>>>>>>> Java
>>>>>>>>>>>> file-system implementation for VFS ?
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Can be an option but when I see the current complexity I'm not
>>>>>>>>>>> sure mixing 2 abstractions would help, maybe just a VfsIO for java 
>>>>>>>>>>> users
>>>>>>>>>>> would be good enough - thinking out loud.
>>>>>>>>>>>
>>>>>>>>>>> What sounds clear to me is that each language will need its own
>>>>>>>>>>> abstraction - which kind of join your proposal. However we can 
>>>>>>>>>>> still make
>>>>>>>>>>> it smooth and easy on the java side - which
>>>>>>>>>>> will likely stay mainstream for still some years - using vfs as
>>>>>>>>>>> our java impl instead of reimplementing the full abstraction? This 
>>>>>>>>>>> way we
>>>>>>>>>>> keep our *API* but we drop beam *impl* to just reuse VFS.
>>>>>>>>>>>
>>>>>>>>>>> PS: for gcs https://github.com/ltouati/vfs-gcs can be a good
>>>>>>>>>>> example on how it can work.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I think a VfsIO makes a lot of sense in the short term, and will
>>>>>>>>>> give use the experience needed to decide if we can move solely to 
>>>>>>>>>> VFS (for
>>>>>>>>>> Java at least) for implementation, and possibly API in a future major
>>>>>>>>>> release, in the long run.
>>>>>>>>>>
>>>>>>>>>
>>>>>
>>>>
>>>>
>>

Reply via email to