+1 for the discussion and tracking.

Regards
JB

On 03/06/2018 12:07 PM, Romain Manni-Bucau wrote:
> created https://issues.apache.org/jira/browse/BEAM-3786 to track the 
> discussion
> (without putting too much details in the ticket for now)
> 
> 
> Romain Manni-Bucau
> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
> <https://rmannibucau.metawerx.net/> | Old Blog
> <http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
> LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
> 
> 2018-03-06 11:27 GMT+01:00 Romain Manni-Bucau <rmannibu...@gmail.com
> <mailto:rmannibu...@gmail.com>>:
> 
>     @Reuven: this was what I had in mind yes.
> 
> 
>     Romain Manni-Bucau
>     @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>     <https://rmannibucau.metawerx.net/> | Old Blog
>     <http://rmannibucau.wordpress.com> | Github
>     <https://github.com/rmannibucau> | LinkedIn
>     <https://www.linkedin.com/in/rmannibucau> | Book
>     
> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
> 
>     2018-03-06 11:24 GMT+01:00 Reuven Lax <re...@google.com
>     <mailto:re...@google.com>>:
> 
>         Part of the point of the current Filesystem class _is_ to handle these
>         things (e.g. bulk delete/rename). If Vfs doesn't, then maybe the right
>         answer is to keep Filesystem but put Vfs under it (and maybe that will
>         eventually allow us to remove some of the current code).
> 
> 
>         On Mon, Mar 5, 2018 at 10:02 PM Romain Manni-Bucau
>         <rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>> wrote:
> 
> 
> 
>             Le 6 mars 2018 01:05, "Lukasz Cwik" <lc...@google.com
>             <mailto:lc...@google.com>> a écrit :
> 
>                 As is, how does VFS improve upon the current FileSystem 
> solution?
> 
> 
>             This is about the ecosystem. I never saw beam fs implemented 
> outside
>             beam but saw a tons of vfs users and impl.
> 
> 
> 
>                 How much work is it before VFS supports the Apache Beam 
> usecases
>                  (bulk operations, glob support)?
> 
> 
>             I dont expect vfs to handle glob but i expect beam to handle them 
> on
>             top of vfs. Said otherwise glob matching is independent of the fs 
> impl.
> 
>             Bulk is a good thing but think it can be handled as well I think.
>             Best would be to support it on top of vfs.
> 
>             I see vfs as  the pure connectivity layer allowing beam to split 
> and
>             parallel process data and not as a complete replacement.
> 
> 
> 
>                 Is it the right direction for the VFS project to support the
>                 above changes? (things that are important to a data parallel
>                 processing system aren't always important for a filesystem
>                 implementation.)
> 
> 
>             Bulk will be. Distributed peocessing will stay in beam, parallel
>             processing is important for vfs IMHO since it targets plain batch
>             (like jbatch) as well.
> 
> 
>                 For example, Apache Beam relies on bulk match, bulk delete, 
> bulk
>                 rename to be able to do things within FileIO efficiently
>                 (Datalow has had a bunch of experience where renaming one file
>                 at a time even when using multiple threads is quite slow when
>                 you have a million files to rename so having bulk APIs is
>                 important). It has a registration mechanism and ties into
>                 PipelineOptions pretty well. In my opinion the largest
>                 deficiency I see with FileSystems is that we should have used
>                 URIs[1] instead of abstract resource types since we could
>                 standardize how URIs are resolved and how globs work for them,
>                 allowing FileSystem authors to implement even less. The tricky
>                 part is how does a URI map onto the file system correctly.
> 
> 
>             Sounds like something unrelated to vfs right? That said it is not
>             too late to use a fallback mecanism when parsing the path?
> 
> 
> 
>                 1: https://issues.apache.org/jira/browse/BEAM-2283
>                 <https://issues.apache.org/jira/browse/BEAM-2283>
> 
>                 On Mon, Mar 5, 2018 at 1:45 PM, Romain Manni-Bucau
>                 <rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>> wrote:
> 
> 
> 
>                     Le 5 mars 2018 22:26, "Robert Bradshaw" 
> <rober...@google.com
>                     <mailto:rober...@google.com>> a écrit :
> 
>                         First, let's try to make the terminology abundantly
>                         clear, as I for one have (I think) misinterpreted what
>                         has been proposed. 
> 
>                         VfsFileSystem: A subclass
>                         of 
> https://github.com/apache/beam/blob/9f81fd299bd32e0d6056a7da9fa994cf74db0ed9/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystem.java
>                         
> <https://github.com/apache/beam/blob/9f81fd299bd32e0d6056a7da9fa994cf74db0ed9/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystem.java>
> 
>                         VfsIO: A replacement
>                         for 
> https://github.com/apache/beam/blob/1e84e49e253f8833f28f1268bec3813029f582d0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java
>                         
> <https://github.com/apache/beam/blob/1e84e49e253f8833f28f1268bec3813029f582d0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java>
>                         written using Vfs instead
>                         of 
> https://github.com/apache/beam/blob/29859eb54d05b96a9db477e7bb04537510273bd2/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystems.java
>                         
> <https://github.com/apache/beam/blob/29859eb54d05b96a9db477e7bb04537510273bd2/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystems.java>
> 
> 
> 
>                     Ack
> 
> 
>                         Between these two options, VfsFileSystem is the way to
>                         go. It will allow us to use all our existing File
>                         sources/sinks (including all the fancy
>                         watching/streaming support from FileIO) with any
>                         filesystem supported by Vcs. Long-term, if VFS is good
>                         enough (and we'll be able to do direct experiments
>                         of HadoopFileSystem vs. VfsFileSystem-on-Hadoop) we
>                         could consider moving to VFS entirely and even 
> removing
>                         the layer of indirection. Vfs is a filesystem, this is
>                         the right level of abstraction to plug into. Even if
>                         it's lacking in some respects, it may still be worth
>                         keeping in parallel to the existing FileSystem
>                         implementations long-term if it has significantly 
> better
>                         coverage. 
> 
> 
>                     Ok
> 
> 
>                         On the other hand, a re-implementation of FileIO on 
> top
>                         of Vfs seems like a lot of duplication of code (and
>                         ongoing maintenance cost) and will be difficult to 
> build
>                         on top of (e.g. the binding of TextIO to FileIO is not
>                         dynamic like the binding of filesystems). 
> 
> 
>                     Well it shouldnt. Let me clarify my view: we - as asf and
>                     not just beam - can make both project growing from that 
> work
>                     and be more mature and interoperable with the existing
>                     ecosystem (who does impl a beam filesystem when providing 
> a
>                     new filesystem). Interesting thing is recent java version
>                     have a filesystem absstraction  too but this one is harder
>                     to make evolving for our need. High level goal is to keep 
> it
>                     ecosystem friendly and not create yet another one.
> 
> 
> 
> 
>                         On Mon, Mar 5, 2018 at 1:05 PM Romain Manni-Bucau
>                         <rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>>
>                         wrote:
> 
>                             Not backing vfs by a filesystem sounds saner so
>                             VfsIO is probably the way to go. It would be a
>                             FileIO concurrent and hopefully replacement on the
>                             mid/long term.
> 
>                             What about doing the opposite: implementing a vfs
>                             filesystem for all the fs we support, potentially
>                             enrich vfs if needed? Then we can just drop beam
>                             abstraction from what i read.
> 
>                             Le 5 mars 2018 20:49, "Reuven Lax" 
> <re...@google.com
>                             <mailto:re...@google.com>> a écrit :
> 
>                                 terminology is confusing here, since the
>                                 existing FileIO is a PTransform. VfsFilesystem
>                                 would be a better name.
> 
> 
>                                 On Mon, Mar 5, 2018 at 11:46 AM Robert 
> Bradshaw
>                                 <rober...@google.com
>                                 <mailto:rober...@google.com>> wrote:
> 
>                                     On Mon, Mar 5, 2018 at 11:38 AM Reuven Lax
>                                     <re...@google.com 
> <mailto:re...@google.com>>
>                                     wrote:
> 
>                                         What about a beam Filesystem impl on 
> top
>                                         of Vfs as an alternative short-term
>                                         solution? This would allow Vfs to be
>                                         used with any IO.
> 
> 
>                                     Yes, I think this is the VfsIO that was
>                                     proposed. 
>                                      
> 
>                                         On Mon, Mar 5, 2018 at 11:37 AM Robert
>                                         Bradshaw <rober...@google.com
>                                         <mailto:rober...@google.com>> wrote:
> 
> 
>                                             On Mon, Mar 5, 2018 at 11:23 AM
>                                             Romain Manni-Bucau
>                                             <rmannibu...@gmail.com
>                                             <mailto:rmannibu...@gmail.com>> 
> wrote:
> 
> 
>                                                 2018-03-05 20:04 GMT+01:00
>                                                 Chamikara Jayalath
>                                                 <chamik...@google.com
>                                                 
> <mailto:chamik...@google.com>>:
> 
>                                                     I assume you
>                                                     mean 
> https://commons.apache.org/proper/commons-vfs/
>                                                     
> <https://commons.apache.org/proper/commons-vfs/>.
> 
>                                                     I'm not sure if we
>                                                     considered this when we
>                                                     originally implemented our
>                                                     own file-system 
> abstraction
>                                                     but based on a quick look
>                                                     seems like this is Java 
> only.
> 
> 
>                                                 Yes, java only
>                                                  
> 
> 
>                                                     I think having a similar
>                                                     file-system abstraction 
> for
>                                                     various languages is a 
> plus
>                                                     point for Beam. May be we
>                                                     should consider a Java
>                                                     file-system implementation
>                                                     for VFS ?
> 
> 
>                                                 Can be an option but when I 
> see
>                                                 the current complexity I'm not
>                                                 sure mixing 2 abstractions 
> would
>                                                 help, maybe just a VfsIO for
>                                                 java users would be good 
> enough
>                                                 - thinking out loud.
> 
>                                                 What sounds clear to me is 
> that
>                                                 each language will need its 
> own
>                                                 abstraction - which kind of 
> join
>                                                 your proposal. However we can
>                                                 still make it smooth and easy 
> on
>                                                 the java side - which
>                                                 will likely stay mainstream 
> for
>                                                 still some years - using vfs 
> as
>                                                 our java impl instead of
>                                                 reimplementing the full
>                                                 abstraction? This way we keep
>                                                 our *API* but we drop beam
>                                                 *impl* to just reuse VFS.
> 
>                                                 PS: for
>                                                 gcs 
> https://github.com/ltouati/vfs-gcs
>                                                 
> <https://github.com/ltouati/vfs-gcs>
>                                                 can be a good example on how 
> it
>                                                 can work.
> 
> 
>                                             I think a VfsIO makes a lot of 
> sense
>                                             in the short term, and will give 
> use
>                                             the experience needed to decide if
>                                             we can move solely to VFS (for 
> Java
>                                             at least) for implementation, and
>                                             possibly API in a future major
>                                             release, in the long run.  
> 
> 
> 
> 
> 
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Reply via email to