Just to share a bit more than a ticket here is a bootstrap impl https://github.com/apache/beam/pull/4803
Romain Manni-Bucau @rmannibucau <https://twitter.com/rmannibucau> | Blog <https://rmannibucau.metawerx.net/> | Old Blog <http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> | LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book <https://www.packtpub.com/application-development/java-ee-8-high-performance> 2018-03-06 12:42 GMT+01:00 Reuven Lax <[email protected]>: > Cool. Then for now we should create a separate Vfs-backed Filesystem impl. > Once Vfs supports all we need, I think we can consider keeping only that. > > Keep in mind that the bulk operations Luke mentioned translate to native > bulk operations for Gcs at least (BatchRequest is part of the Gcs API). I'm > not entirely sure whether HDFS natively supports this or not. This implies > that we would need some way of expressing bulk operations through Vfs. > > > On Tue, Mar 6, 2018 at 2:27 AM Romain Manni-Bucau <[email protected]> > wrote: > >> @Reuven: this was what I had in mind yes. >> >> >> Romain Manni-Bucau >> @rmannibucau <https://twitter.com/rmannibucau> | Blog >> <https://rmannibucau.metawerx.net/> | Old Blog >> <http://rmannibucau.wordpress.com> | Github >> <https://github.com/rmannibucau> | LinkedIn >> <https://www.linkedin.com/in/rmannibucau> | Book >> <https://www.packtpub.com/application-development/java-ee-8-high-performance> >> >> 2018-03-06 11:24 GMT+01:00 Reuven Lax <[email protected]>: >> >>> Part of the point of the current Filesystem class _is_ to handle these >>> things (e.g. bulk delete/rename). If Vfs doesn't, then maybe the right >>> answer is to keep Filesystem but put Vfs under it (and maybe that will >>> eventually allow us to remove some of the current code). >>> >>> >>> On Mon, Mar 5, 2018 at 10:02 PM Romain Manni-Bucau < >>> [email protected]> wrote: >>> >>>> >>>> >>>> Le 6 mars 2018 01:05, "Lukasz Cwik" <[email protected]> a écrit : >>>> >>>> As is, how does VFS improve upon the current FileSystem solution? >>>> >>>> >>>> This is about the ecosystem. I never saw beam fs implemented outside >>>> beam but saw a tons of vfs users and impl. >>>> >>>> >>>> >>>> How much work is it before VFS supports the Apache Beam usecases (bulk >>>> operations, glob support)? >>>> >>>> >>>> I dont expect vfs to handle glob but i expect beam to handle them on >>>> top of vfs. Said otherwise glob matching is independent of the fs impl. >>>> >>>> Bulk is a good thing but think it can be handled as well I think. Best >>>> would be to support it on top of vfs. >>>> >>>> I see vfs as the pure connectivity layer allowing beam to split and >>>> parallel process data and not as a complete replacement. >>>> >>>> >>>> >>>> Is it the right direction for the VFS project to support the above >>>> changes? (things that are important to a data parallel processing >>>> system aren't always important for a filesystem implementation.) >>>> >>>> >>>> Bulk will be. Distributed peocessing will stay in beam, parallel >>>> processing is important for vfs IMHO since it targets plain batch (like >>>> jbatch) as well. >>>> >>>> >>>> For example, Apache Beam relies on bulk match, bulk delete, bulk rename >>>> to be able to do things within FileIO efficiently (Datalow has had a >>>> bunch of experience where renaming one file at a time even when using >>>> multiple threads is quite slow when you have a million files to rename so >>>> having bulk APIs is important). It has a registration mechanism and >>>> ties into PipelineOptions pretty well. In my opinion the largest deficiency >>>> I see with FileSystems is that we should have used URIs[1] instead of >>>> abstract resource types since we could standardize how URIs are resolved >>>> and how globs work for them, allowing FileSystem authors to implement even >>>> less. The tricky part is how does a URI map onto the file system correctly. >>>> >>>> >>>> Sounds like something unrelated to vfs right? That said it is not too >>>> late to use a fallback mecanism when parsing the path? >>>> >>>> >>>> >>>> 1: https://issues.apache.org/jira/browse/BEAM-2283 >>>> >>>> On Mon, Mar 5, 2018 at 1:45 PM, Romain Manni-Bucau < >>>> [email protected]> wrote: >>>> >>>>> >>>>> >>>>> Le 5 mars 2018 22:26, "Robert Bradshaw" <[email protected]> a >>>>> écrit : >>>>> >>>>> First, let's try to make the terminology abundantly clear, as I for >>>>> one have (I think) misinterpreted what has been proposed. >>>>> >>>>> VfsFileSystem: A subclass of https://github.com/apache/beam/blob/ >>>>> 9f81fd299bd32e0d6056a7da9fa994cf74db0ed9/sdks/java/core/src/ >>>>> main/java/org/apache/beam/sdk/io/FileSystem.java >>>>> >>>>> VfsIO: A replacement for https://github.com/apache/beam/blob/ >>>>> 1e84e49e253f8833f28f1268bec3813029f582d0/sdks/java/core/src/ >>>>> main/java/org/apache/beam/sdk/io/FileIO.java written using Vfs >>>>> instead of https://github.com/apache/beam/blob/ >>>>> 29859eb54d05b96a9db477e7bb04537510273bd2/sdks/java/core/src/ >>>>> main/java/org/apache/beam/sdk/io/FileSystems.java >>>>> >>>>> >>>>> >>>>> Ack >>>>> >>>>> >>>>> Between these two options, VfsFileSystem is the way to go. It will >>>>> allow us to use all our existing File sources/sinks (including all the >>>>> fancy watching/streaming support from FileIO) with any filesystem >>>>> supported >>>>> by Vcs. Long-term, if VFS is good enough (and we'll be able to do direct >>>>> experiments of HadoopFileSystem vs. VfsFileSystem-on-Hadoop) we could >>>>> consider moving to VFS entirely and even removing the layer of >>>>> indirection. >>>>> Vfs is a filesystem, this is the right level of abstraction to plug into. >>>>> Even if it's lacking in some respects, it may still be worth keeping in >>>>> parallel to the existing FileSystem implementations long-term if it has >>>>> significantly better coverage. >>>>> >>>>> >>>>> Ok >>>>> >>>>> >>>>> On the other hand, a re-implementation of FileIO on top of Vfs seems >>>>> like a lot of duplication of code (and ongoing maintenance cost) and will >>>>> be difficult to build on top of (e.g. the binding of TextIO to FileIO is >>>>> not dynamic like the binding of filesystems). >>>>> >>>>> >>>>> Well it shouldnt. Let me clarify my view: we - as asf and not just >>>>> beam - can make both project growing from that work and be more mature and >>>>> interoperable with the existing ecosystem (who does impl a beam filesystem >>>>> when providing a new filesystem). Interesting thing is recent java version >>>>> have a filesystem absstraction too but this one is harder to make >>>>> evolving >>>>> for our need. High level goal is to keep it ecosystem friendly and not >>>>> create yet another one. >>>>> >>>>> >>>>> >>>>> >>>>> On Mon, Mar 5, 2018 at 1:05 PM Romain Manni-Bucau < >>>>> [email protected]> wrote: >>>>> >>>>>> Not backing vfs by a filesystem sounds saner so VfsIO is probably the >>>>>> way to go. It would be a FileIO concurrent and hopefully replacement on >>>>>> the >>>>>> mid/long term. >>>>>> >>>>>> What about doing the opposite: implementing a vfs filesystem for all >>>>>> the fs we support, potentially enrich vfs if needed? Then we can just >>>>>> drop >>>>>> beam abstraction from what i read. >>>>>> >>>>>> Le 5 mars 2018 20:49, "Reuven Lax" <[email protected]> a écrit : >>>>>> >>>>>>> terminology is confusing here, since the existing FileIO is a >>>>>>> PTransform. VfsFilesystem would be a better name. >>>>>>> >>>>>>> >>>>>>> On Mon, Mar 5, 2018 at 11:46 AM Robert Bradshaw <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> On Mon, Mar 5, 2018 at 11:38 AM Reuven Lax <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> What about a beam Filesystem impl on top of Vfs as an alternative >>>>>>>>> short-term solution? This would allow Vfs to be used with any IO. >>>>>>>>> >>>>>>>> >>>>>>>> Yes, I think this is the VfsIO that was proposed. >>>>>>>> >>>>>>>> >>>>>>>>> On Mon, Mar 5, 2018 at 11:37 AM Robert Bradshaw < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mon, Mar 5, 2018 at 11:23 AM Romain Manni-Bucau < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> 2018-03-05 20:04 GMT+01:00 Chamikara Jayalath < >>>>>>>>>>> [email protected]>: >>>>>>>>>>> >>>>>>>>>>>> I assume you mean https://commons.apache. >>>>>>>>>>>> org/proper/commons-vfs/. >>>>>>>>>>>> >>>>>>>>>>>> I'm not sure if we considered this when we originally >>>>>>>>>>>> implemented our own file-system abstraction but based on a quick >>>>>>>>>>>> look seems >>>>>>>>>>>> like this is Java only. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Yes, java only >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I think having a similar file-system abstraction for various >>>>>>>>>>>> languages is a plus point for Beam. May be we should consider a >>>>>>>>>>>> Java >>>>>>>>>>>> file-system implementation for VFS ? >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Can be an option but when I see the current complexity I'm not >>>>>>>>>>> sure mixing 2 abstractions would help, maybe just a VfsIO for java >>>>>>>>>>> users >>>>>>>>>>> would be good enough - thinking out loud. >>>>>>>>>>> >>>>>>>>>>> What sounds clear to me is that each language will need its own >>>>>>>>>>> abstraction - which kind of join your proposal. However we can >>>>>>>>>>> still make >>>>>>>>>>> it smooth and easy on the java side - which >>>>>>>>>>> will likely stay mainstream for still some years - using vfs as >>>>>>>>>>> our java impl instead of reimplementing the full abstraction? This >>>>>>>>>>> way we >>>>>>>>>>> keep our *API* but we drop beam *impl* to just reuse VFS. >>>>>>>>>>> >>>>>>>>>>> PS: for gcs https://github.com/ltouati/vfs-gcs can be a good >>>>>>>>>>> example on how it can work. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I think a VfsIO makes a lot of sense in the short term, and will >>>>>>>>>> give use the experience needed to decide if we can move solely to >>>>>>>>>> VFS (for >>>>>>>>>> Java at least) for implementation, and possibly API in a future major >>>>>>>>>> release, in the long run. >>>>>>>>>> >>>>>>>>> >>>>> >>>> >>>> >>
