+1 for the discussion and tracking. Regards JB
On 03/06/2018 12:07 PM, Romain Manni-Bucau wrote: > created https://issues.apache.org/jira/browse/BEAM-3786 to track the > discussion > (without putting too much details in the ticket for now) > > > Romain Manni-Bucau > @rmannibucau <https://twitter.com/rmannibucau> | Blog > <https://rmannibucau.metawerx.net/> | Old Blog > <http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> | > LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book > <https://www.packtpub.com/application-development/java-ee-8-high-performance> > > 2018-03-06 11:27 GMT+01:00 Romain Manni-Bucau <rmannibu...@gmail.com > <mailto:rmannibu...@gmail.com>>: > > @Reuven: this was what I had in mind yes. > > > Romain Manni-Bucau > @rmannibucau <https://twitter.com/rmannibucau> | Blog > <https://rmannibucau.metawerx.net/> | Old Blog > <http://rmannibucau.wordpress.com> | Github > <https://github.com/rmannibucau> | LinkedIn > <https://www.linkedin.com/in/rmannibucau> | Book > > <https://www.packtpub.com/application-development/java-ee-8-high-performance> > > 2018-03-06 11:24 GMT+01:00 Reuven Lax <re...@google.com > <mailto:re...@google.com>>: > > Part of the point of the current Filesystem class _is_ to handle these > things (e.g. bulk delete/rename). If Vfs doesn't, then maybe the right > answer is to keep Filesystem but put Vfs under it (and maybe that will > eventually allow us to remove some of the current code). > > > On Mon, Mar 5, 2018 at 10:02 PM Romain Manni-Bucau > <rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>> wrote: > > > > Le 6 mars 2018 01:05, "Lukasz Cwik" <lc...@google.com > <mailto:lc...@google.com>> a écrit : > > As is, how does VFS improve upon the current FileSystem > solution? > > > This is about the ecosystem. I never saw beam fs implemented > outside > beam but saw a tons of vfs users and impl. > > > > How much work is it before VFS supports the Apache Beam > usecases > (bulk operations, glob support)? > > > I dont expect vfs to handle glob but i expect beam to handle them > on > top of vfs. Said otherwise glob matching is independent of the fs > impl. > > Bulk is a good thing but think it can be handled as well I think. > Best would be to support it on top of vfs. > > I see vfs as the pure connectivity layer allowing beam to split > and > parallel process data and not as a complete replacement. > > > > Is it the right direction for the VFS project to support the > above changes? (things that are important to a data parallel > processing system aren't always important for a filesystem > implementation.) > > > Bulk will be. Distributed peocessing will stay in beam, parallel > processing is important for vfs IMHO since it targets plain batch > (like jbatch) as well. > > > For example, Apache Beam relies on bulk match, bulk delete, > bulk > rename to be able to do things within FileIO efficiently > (Datalow has had a bunch of experience where renaming one file > at a time even when using multiple threads is quite slow when > you have a million files to rename so having bulk APIs is > important). It has a registration mechanism and ties into > PipelineOptions pretty well. In my opinion the largest > deficiency I see with FileSystems is that we should have used > URIs[1] instead of abstract resource types since we could > standardize how URIs are resolved and how globs work for them, > allowing FileSystem authors to implement even less. The tricky > part is how does a URI map onto the file system correctly. > > > Sounds like something unrelated to vfs right? That said it is not > too late to use a fallback mecanism when parsing the path? > > > > 1: https://issues.apache.org/jira/browse/BEAM-2283 > <https://issues.apache.org/jira/browse/BEAM-2283> > > On Mon, Mar 5, 2018 at 1:45 PM, Romain Manni-Bucau > <rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>> wrote: > > > > Le 5 mars 2018 22:26, "Robert Bradshaw" > <rober...@google.com > <mailto:rober...@google.com>> a écrit : > > First, let's try to make the terminology abundantly > clear, as I for one have (I think) misinterpreted what > has been proposed. > > VfsFileSystem: A subclass > of > https://github.com/apache/beam/blob/9f81fd299bd32e0d6056a7da9fa994cf74db0ed9/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystem.java > > <https://github.com/apache/beam/blob/9f81fd299bd32e0d6056a7da9fa994cf74db0ed9/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystem.java> > > VfsIO: A replacement > for > https://github.com/apache/beam/blob/1e84e49e253f8833f28f1268bec3813029f582d0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java > > <https://github.com/apache/beam/blob/1e84e49e253f8833f28f1268bec3813029f582d0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java> > written using Vfs instead > of > https://github.com/apache/beam/blob/29859eb54d05b96a9db477e7bb04537510273bd2/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystems.java > > <https://github.com/apache/beam/blob/29859eb54d05b96a9db477e7bb04537510273bd2/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystems.java> > > > > Ack > > > Between these two options, VfsFileSystem is the way to > go. It will allow us to use all our existing File > sources/sinks (including all the fancy > watching/streaming support from FileIO) with any > filesystem supported by Vcs. Long-term, if VFS is good > enough (and we'll be able to do direct experiments > of HadoopFileSystem vs. VfsFileSystem-on-Hadoop) we > could consider moving to VFS entirely and even > removing > the layer of indirection. Vfs is a filesystem, this is > the right level of abstraction to plug into. Even if > it's lacking in some respects, it may still be worth > keeping in parallel to the existing FileSystem > implementations long-term if it has significantly > better > coverage. > > > Ok > > > On the other hand, a re-implementation of FileIO on > top > of Vfs seems like a lot of duplication of code (and > ongoing maintenance cost) and will be difficult to > build > on top of (e.g. the binding of TextIO to FileIO is not > dynamic like the binding of filesystems). > > > Well it shouldnt. Let me clarify my view: we - as asf and > not just beam - can make both project growing from that > work > and be more mature and interoperable with the existing > ecosystem (who does impl a beam filesystem when providing > a > new filesystem). Interesting thing is recent java version > have a filesystem absstraction too but this one is harder > to make evolving for our need. High level goal is to keep > it > ecosystem friendly and not create yet another one. > > > > > On Mon, Mar 5, 2018 at 1:05 PM Romain Manni-Bucau > <rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>> > wrote: > > Not backing vfs by a filesystem sounds saner so > VfsIO is probably the way to go. It would be a > FileIO concurrent and hopefully replacement on the > mid/long term. > > What about doing the opposite: implementing a vfs > filesystem for all the fs we support, potentially > enrich vfs if needed? Then we can just drop beam > abstraction from what i read. > > Le 5 mars 2018 20:49, "Reuven Lax" > <re...@google.com > <mailto:re...@google.com>> a écrit : > > terminology is confusing here, since the > existing FileIO is a PTransform. VfsFilesystem > would be a better name. > > > On Mon, Mar 5, 2018 at 11:46 AM Robert > Bradshaw > <rober...@google.com > <mailto:rober...@google.com>> wrote: > > On Mon, Mar 5, 2018 at 11:38 AM Reuven Lax > <re...@google.com > <mailto:re...@google.com>> > wrote: > > What about a beam Filesystem impl on > top > of Vfs as an alternative short-term > solution? This would allow Vfs to be > used with any IO. > > > Yes, I think this is the VfsIO that was > proposed. > > > On Mon, Mar 5, 2018 at 11:37 AM Robert > Bradshaw <rober...@google.com > <mailto:rober...@google.com>> wrote: > > > On Mon, Mar 5, 2018 at 11:23 AM > Romain Manni-Bucau > <rmannibu...@gmail.com > <mailto:rmannibu...@gmail.com>> > wrote: > > > 2018-03-05 20:04 GMT+01:00 > Chamikara Jayalath > <chamik...@google.com > > <mailto:chamik...@google.com>>: > > I assume you > mean > https://commons.apache.org/proper/commons-vfs/ > > <https://commons.apache.org/proper/commons-vfs/>. > > I'm not sure if we > considered this when we > originally implemented our > own file-system > abstraction > but based on a quick look > seems like this is Java > only. > > > Yes, java only > > > > I think having a similar > file-system abstraction > for > various languages is a > plus > point for Beam. May be we > should consider a Java > file-system implementation > for VFS ? > > > Can be an option but when I > see > the current complexity I'm not > sure mixing 2 abstractions > would > help, maybe just a VfsIO for > java users would be good > enough > - thinking out loud. > > What sounds clear to me is > that > each language will need its > own > abstraction - which kind of > join > your proposal. However we can > still make it smooth and easy > on > the java side - which > will likely stay mainstream > for > still some years - using vfs > as > our java impl instead of > reimplementing the full > abstraction? This way we keep > our *API* but we drop beam > *impl* to just reuse VFS. > > PS: for > gcs > https://github.com/ltouati/vfs-gcs > > <https://github.com/ltouati/vfs-gcs> > can be a good example on how > it > can work. > > > I think a VfsIO makes a lot of > sense > in the short term, and will give > use > the experience needed to decide if > we can move solely to VFS (for > Java > at least) for implementation, and > possibly API in a future major > release, in the long run. > > > > > > -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com