created https://issues.apache.org/jira/browse/BEAM-3786 to track the discussion (without putting too much details in the ticket for now)
Romain Manni-Bucau @rmannibucau <https://twitter.com/rmannibucau> | Blog <https://rmannibucau.metawerx.net/> | Old Blog <http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> | LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book <https://www.packtpub.com/application-development/java-ee-8-high-performance> 2018-03-06 11:27 GMT+01:00 Romain Manni-Bucau <[email protected]>: > @Reuven: this was what I had in mind yes. > > > Romain Manni-Bucau > @rmannibucau <https://twitter.com/rmannibucau> | Blog > <https://rmannibucau.metawerx.net/> | Old Blog > <http://rmannibucau.wordpress.com> | Github > <https://github.com/rmannibucau> | LinkedIn > <https://www.linkedin.com/in/rmannibucau> | Book > <https://www.packtpub.com/application-development/java-ee-8-high-performance> > > 2018-03-06 11:24 GMT+01:00 Reuven Lax <[email protected]>: > >> Part of the point of the current Filesystem class _is_ to handle these >> things (e.g. bulk delete/rename). If Vfs doesn't, then maybe the right >> answer is to keep Filesystem but put Vfs under it (and maybe that will >> eventually allow us to remove some of the current code). >> >> >> On Mon, Mar 5, 2018 at 10:02 PM Romain Manni-Bucau <[email protected]> >> wrote: >> >>> >>> >>> Le 6 mars 2018 01:05, "Lukasz Cwik" <[email protected]> a écrit : >>> >>> As is, how does VFS improve upon the current FileSystem solution? >>> >>> >>> This is about the ecosystem. I never saw beam fs implemented outside >>> beam but saw a tons of vfs users and impl. >>> >>> >>> >>> How much work is it before VFS supports the Apache Beam usecases (bulk >>> operations, glob support)? >>> >>> >>> I dont expect vfs to handle glob but i expect beam to handle them on top >>> of vfs. Said otherwise glob matching is independent of the fs impl. >>> >>> Bulk is a good thing but think it can be handled as well I think. Best >>> would be to support it on top of vfs. >>> >>> I see vfs as the pure connectivity layer allowing beam to split and >>> parallel process data and not as a complete replacement. >>> >>> >>> >>> Is it the right direction for the VFS project to support the above >>> changes? (things that are important to a data parallel processing >>> system aren't always important for a filesystem implementation.) >>> >>> >>> Bulk will be. Distributed peocessing will stay in beam, parallel >>> processing is important for vfs IMHO since it targets plain batch (like >>> jbatch) as well. >>> >>> >>> For example, Apache Beam relies on bulk match, bulk delete, bulk rename >>> to be able to do things within FileIO efficiently (Datalow has had a >>> bunch of experience where renaming one file at a time even when using >>> multiple threads is quite slow when you have a million files to rename so >>> having bulk APIs is important). It has a registration mechanism and >>> ties into PipelineOptions pretty well. In my opinion the largest deficiency >>> I see with FileSystems is that we should have used URIs[1] instead of >>> abstract resource types since we could standardize how URIs are resolved >>> and how globs work for them, allowing FileSystem authors to implement even >>> less. The tricky part is how does a URI map onto the file system correctly. >>> >>> >>> Sounds like something unrelated to vfs right? That said it is not too >>> late to use a fallback mecanism when parsing the path? >>> >>> >>> >>> 1: https://issues.apache.org/jira/browse/BEAM-2283 >>> >>> On Mon, Mar 5, 2018 at 1:45 PM, Romain Manni-Bucau < >>> [email protected]> wrote: >>> >>>> >>>> >>>> Le 5 mars 2018 22:26, "Robert Bradshaw" <[email protected]> a écrit : >>>> >>>> First, let's try to make the terminology abundantly clear, as I for one >>>> have (I think) misinterpreted what has been proposed. >>>> >>>> VfsFileSystem: A subclass of https://github.com/apache/b >>>> eam/blob/9f81fd299bd32e0d6056a7da9fa994cf74db0ed9/sdks/java/ >>>> core/src/main/java/org/apache/beam/sdk/io/FileSystem.java >>>> >>>> VfsIO: A replacement for https://github.com/apache/ >>>> beam/blob/1e84e49e253f8833f28f1268bec3813029f582d0/sdks/ >>>> java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java written >>>> using Vfs instead of https://github.com/apache/b >>>> eam/blob/29859eb54d05b96a9db477e7bb04537510273bd2/sdks/java/ >>>> core/src/main/java/org/apache/beam/sdk/io/FileSystems.java >>>> >>>> >>>> >>>> Ack >>>> >>>> >>>> Between these two options, VfsFileSystem is the way to go. It will >>>> allow us to use all our existing File sources/sinks (including all the >>>> fancy watching/streaming support from FileIO) with any filesystem supported >>>> by Vcs. Long-term, if VFS is good enough (and we'll be able to do direct >>>> experiments of HadoopFileSystem vs. VfsFileSystem-on-Hadoop) we could >>>> consider moving to VFS entirely and even removing the layer of indirection. >>>> Vfs is a filesystem, this is the right level of abstraction to plug into. >>>> Even if it's lacking in some respects, it may still be worth keeping in >>>> parallel to the existing FileSystem implementations long-term if it has >>>> significantly better coverage. >>>> >>>> >>>> Ok >>>> >>>> >>>> On the other hand, a re-implementation of FileIO on top of Vfs seems >>>> like a lot of duplication of code (and ongoing maintenance cost) and will >>>> be difficult to build on top of (e.g. the binding of TextIO to FileIO is >>>> not dynamic like the binding of filesystems). >>>> >>>> >>>> Well it shouldnt. Let me clarify my view: we - as asf and not just beam >>>> - can make both project growing from that work and be more mature and >>>> interoperable with the existing ecosystem (who does impl a beam filesystem >>>> when providing a new filesystem). Interesting thing is recent java version >>>> have a filesystem absstraction too but this one is harder to make evolving >>>> for our need. High level goal is to keep it ecosystem friendly and not >>>> create yet another one. >>>> >>>> >>>> >>>> >>>> On Mon, Mar 5, 2018 at 1:05 PM Romain Manni-Bucau < >>>> [email protected]> wrote: >>>> >>>>> Not backing vfs by a filesystem sounds saner so VfsIO is probably the >>>>> way to go. It would be a FileIO concurrent and hopefully replacement on >>>>> the >>>>> mid/long term. >>>>> >>>>> What about doing the opposite: implementing a vfs filesystem for all >>>>> the fs we support, potentially enrich vfs if needed? Then we can just drop >>>>> beam abstraction from what i read. >>>>> >>>>> Le 5 mars 2018 20:49, "Reuven Lax" <[email protected]> a écrit : >>>>> >>>>>> terminology is confusing here, since the existing FileIO is a >>>>>> PTransform. VfsFilesystem would be a better name. >>>>>> >>>>>> >>>>>> On Mon, Mar 5, 2018 at 11:46 AM Robert Bradshaw <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> On Mon, Mar 5, 2018 at 11:38 AM Reuven Lax <[email protected]> wrote: >>>>>>> >>>>>>>> What about a beam Filesystem impl on top of Vfs as an alternative >>>>>>>> short-term solution? This would allow Vfs to be used with any IO. >>>>>>>> >>>>>>> >>>>>>> Yes, I think this is the VfsIO that was proposed. >>>>>>> >>>>>>> >>>>>>>> On Mon, Mar 5, 2018 at 11:37 AM Robert Bradshaw < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Mar 5, 2018 at 11:23 AM Romain Manni-Bucau < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> 2018-03-05 20:04 GMT+01:00 Chamikara Jayalath < >>>>>>>>>> [email protected]>: >>>>>>>>>> >>>>>>>>>>> I assume you mean https://commons.apache.org/proper/commons-vfs/ >>>>>>>>>>> . >>>>>>>>>>> >>>>>>>>>>> I'm not sure if we considered this when we originally >>>>>>>>>>> implemented our own file-system abstraction but based on a quick >>>>>>>>>>> look seems >>>>>>>>>>> like this is Java only. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Yes, java only >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I think having a similar file-system abstraction for various >>>>>>>>>>> languages is a plus point for Beam. May be we should consider a Java >>>>>>>>>>> file-system implementation for VFS ? >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Can be an option but when I see the current complexity I'm not >>>>>>>>>> sure mixing 2 abstractions would help, maybe just a VfsIO for java >>>>>>>>>> users >>>>>>>>>> would be good enough - thinking out loud. >>>>>>>>>> >>>>>>>>>> What sounds clear to me is that each language will need its own >>>>>>>>>> abstraction - which kind of join your proposal. However we can still >>>>>>>>>> make >>>>>>>>>> it smooth and easy on the java side - which >>>>>>>>>> will likely stay mainstream for still some years - using vfs as >>>>>>>>>> our java impl instead of reimplementing the full abstraction? This >>>>>>>>>> way we >>>>>>>>>> keep our *API* but we drop beam *impl* to just reuse VFS. >>>>>>>>>> >>>>>>>>>> PS: for gcs https://github.com/ltouati/vfs-gcs can be a good >>>>>>>>>> example on how it can work. >>>>>>>>>> >>>>>>>>> >>>>>>>>> I think a VfsIO makes a lot of sense in the short term, and will >>>>>>>>> give use the experience needed to decide if we can move solely to VFS >>>>>>>>> (for >>>>>>>>> Java at least) for implementation, and possibly API in a future major >>>>>>>>> release, in the long run. >>>>>>>>> >>>>>>>> >>>> >>> >>> >
