@Reuven: this was what I had in mind yes.
Romain Manni-Bucau @rmannibucau <https://twitter.com/rmannibucau> | Blog <https://rmannibucau.metawerx.net/> | Old Blog <http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> | LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book <https://www.packtpub.com/application-development/java-ee-8-high-performance> 2018-03-06 11:24 GMT+01:00 Reuven Lax <re...@google.com>: > Part of the point of the current Filesystem class _is_ to handle these > things (e.g. bulk delete/rename). If Vfs doesn't, then maybe the right > answer is to keep Filesystem but put Vfs under it (and maybe that will > eventually allow us to remove some of the current code). > > > On Mon, Mar 5, 2018 at 10:02 PM Romain Manni-Bucau <rmannibu...@gmail.com> > wrote: > >> >> >> Le 6 mars 2018 01:05, "Lukasz Cwik" <lc...@google.com> a écrit : >> >> As is, how does VFS improve upon the current FileSystem solution? >> >> >> This is about the ecosystem. I never saw beam fs implemented outside beam >> but saw a tons of vfs users and impl. >> >> >> >> How much work is it before VFS supports the Apache Beam usecases (bulk >> operations, glob support)? >> >> >> I dont expect vfs to handle glob but i expect beam to handle them on top >> of vfs. Said otherwise glob matching is independent of the fs impl. >> >> Bulk is a good thing but think it can be handled as well I think. Best >> would be to support it on top of vfs. >> >> I see vfs as the pure connectivity layer allowing beam to split and >> parallel process data and not as a complete replacement. >> >> >> >> Is it the right direction for the VFS project to support the above >> changes? (things that are important to a data parallel processing system >> aren't always important for a filesystem implementation.) >> >> >> Bulk will be. Distributed peocessing will stay in beam, parallel >> processing is important for vfs IMHO since it targets plain batch (like >> jbatch) as well. >> >> >> For example, Apache Beam relies on bulk match, bulk delete, bulk rename >> to be able to do things within FileIO efficiently (Datalow has had a >> bunch of experience where renaming one file at a time even when using >> multiple threads is quite slow when you have a million files to rename so >> having bulk APIs is important). It has a registration mechanism and ties >> into PipelineOptions pretty well. In my opinion the largest deficiency I >> see with FileSystems is that we should have used URIs[1] instead of >> abstract resource types since we could standardize how URIs are resolved >> and how globs work for them, allowing FileSystem authors to implement even >> less. The tricky part is how does a URI map onto the file system correctly. >> >> >> Sounds like something unrelated to vfs right? That said it is not too >> late to use a fallback mecanism when parsing the path? >> >> >> >> 1: https://issues.apache.org/jira/browse/BEAM-2283 >> >> On Mon, Mar 5, 2018 at 1:45 PM, Romain Manni-Bucau <rmannibu...@gmail.com >> > wrote: >> >>> >>> >>> Le 5 mars 2018 22:26, "Robert Bradshaw" <rober...@google.com> a écrit : >>> >>> First, let's try to make the terminology abundantly clear, as I for one >>> have (I think) misinterpreted what has been proposed. >>> >>> VfsFileSystem: A subclass of https://github.com/apache/beam/blob/ >>> 9f81fd299bd32e0d6056a7da9fa994cf74db0ed9/sdks/java/core/src/ >>> main/java/org/apache/beam/sdk/io/FileSystem.java >>> >>> VfsIO: A replacement for https://github.com/apache/beam/blob/ >>> 1e84e49e253f8833f28f1268bec3813029f582d0/sdks/java/core/src/ >>> main/java/org/apache/beam/sdk/io/FileIO.java written using Vfs instead >>> of https://github.com/apache/beam/blob/29859eb54d05b96a9db477e7bb0453 >>> 7510273bd2/sdks/java/core/src/main/java/org/apache/beam/sdk/ >>> io/FileSystems.java >>> >>> >>> >>> Ack >>> >>> >>> Between these two options, VfsFileSystem is the way to go. It will allow >>> us to use all our existing File sources/sinks (including all the fancy >>> watching/streaming support from FileIO) with any filesystem supported by >>> Vcs. Long-term, if VFS is good enough (and we'll be able to do direct >>> experiments of HadoopFileSystem vs. VfsFileSystem-on-Hadoop) we could >>> consider moving to VFS entirely and even removing the layer of indirection. >>> Vfs is a filesystem, this is the right level of abstraction to plug into. >>> Even if it's lacking in some respects, it may still be worth keeping in >>> parallel to the existing FileSystem implementations long-term if it has >>> significantly better coverage. >>> >>> >>> Ok >>> >>> >>> On the other hand, a re-implementation of FileIO on top of Vfs seems >>> like a lot of duplication of code (and ongoing maintenance cost) and will >>> be difficult to build on top of (e.g. the binding of TextIO to FileIO is >>> not dynamic like the binding of filesystems). >>> >>> >>> Well it shouldnt. Let me clarify my view: we - as asf and not just beam >>> - can make both project growing from that work and be more mature and >>> interoperable with the existing ecosystem (who does impl a beam filesystem >>> when providing a new filesystem). Interesting thing is recent java version >>> have a filesystem absstraction too but this one is harder to make evolving >>> for our need. High level goal is to keep it ecosystem friendly and not >>> create yet another one. >>> >>> >>> >>> >>> On Mon, Mar 5, 2018 at 1:05 PM Romain Manni-Bucau <rmannibu...@gmail.com> >>> wrote: >>> >>>> Not backing vfs by a filesystem sounds saner so VfsIO is probably the >>>> way to go. It would be a FileIO concurrent and hopefully replacement on the >>>> mid/long term. >>>> >>>> What about doing the opposite: implementing a vfs filesystem for all >>>> the fs we support, potentially enrich vfs if needed? Then we can just drop >>>> beam abstraction from what i read. >>>> >>>> Le 5 mars 2018 20:49, "Reuven Lax" <re...@google.com> a écrit : >>>> >>>>> terminology is confusing here, since the existing FileIO is a >>>>> PTransform. VfsFilesystem would be a better name. >>>>> >>>>> >>>>> On Mon, Mar 5, 2018 at 11:46 AM Robert Bradshaw <rober...@google.com> >>>>> wrote: >>>>> >>>>>> On Mon, Mar 5, 2018 at 11:38 AM Reuven Lax <re...@google.com> wrote: >>>>>> >>>>>>> What about a beam Filesystem impl on top of Vfs as an alternative >>>>>>> short-term solution? This would allow Vfs to be used with any IO. >>>>>>> >>>>>> >>>>>> Yes, I think this is the VfsIO that was proposed. >>>>>> >>>>>> >>>>>>> On Mon, Mar 5, 2018 at 11:37 AM Robert Bradshaw <rober...@google.com> >>>>>>> wrote: >>>>>>> >>>>>>>> >>>>>>>> On Mon, Mar 5, 2018 at 11:23 AM Romain Manni-Bucau < >>>>>>>> rmannibu...@gmail.com> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> 2018-03-05 20:04 GMT+01:00 Chamikara Jayalath < >>>>>>>>> chamik...@google.com>: >>>>>>>>> >>>>>>>>>> I assume you mean https://commons.apache.org/proper/commons-vfs/. >>>>>>>>>> >>>>>>>>>> I'm not sure if we considered this when we originally implemented >>>>>>>>>> our own file-system abstraction but based on a quick look seems like >>>>>>>>>> this >>>>>>>>>> is Java only. >>>>>>>>>> >>>>>>>>> >>>>>>>>> Yes, java only >>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> I think having a similar file-system abstraction for various >>>>>>>>>> languages is a plus point for Beam. May be we should consider a Java >>>>>>>>>> file-system implementation for VFS ? >>>>>>>>>> >>>>>>>>> >>>>>>>>> Can be an option but when I see the current complexity I'm not >>>>>>>>> sure mixing 2 abstractions would help, maybe just a VfsIO for java >>>>>>>>> users >>>>>>>>> would be good enough - thinking out loud. >>>>>>>>> >>>>>>>>> What sounds clear to me is that each language will need its own >>>>>>>>> abstraction - which kind of join your proposal. However we can still >>>>>>>>> make >>>>>>>>> it smooth and easy on the java side - which >>>>>>>>> will likely stay mainstream for still some years - using vfs as >>>>>>>>> our java impl instead of reimplementing the full abstraction? This >>>>>>>>> way we >>>>>>>>> keep our *API* but we drop beam *impl* to just reuse VFS. >>>>>>>>> >>>>>>>>> PS: for gcs https://github.com/ltouati/vfs-gcs can be a good >>>>>>>>> example on how it can work. >>>>>>>>> >>>>>>>> >>>>>>>> I think a VfsIO makes a lot of sense in the short term, and will >>>>>>>> give use the experience needed to decide if we can move solely to VFS >>>>>>>> (for >>>>>>>> Java at least) for implementation, and possibly API in a future major >>>>>>>> release, in the long run. >>>>>>>> >>>>>>> >>> >> >>