Part of the point of the current Filesystem class _is_ to handle these
things (e.g. bulk delete/rename). If Vfs doesn't, then maybe the right
answer is to keep Filesystem but put Vfs under it (and maybe that will
eventually allow us to remove some of the current code).


On Mon, Mar 5, 2018 at 10:02 PM Romain Manni-Bucau <rmannibu...@gmail.com>
wrote:

>
>
> Le 6 mars 2018 01:05, "Lukasz Cwik" <lc...@google.com> a écrit :
>
> As is, how does VFS improve upon the current FileSystem solution?
>
>
> This is about the ecosystem. I never saw beam fs implemented outside beam
> but saw a tons of vfs users and impl.
>
>
>
> How much work is it before VFS supports the Apache Beam usecases  (bulk
> operations, glob support)?
>
>
> I dont expect vfs to handle glob but i expect beam to handle them on top
> of vfs. Said otherwise glob matching is independent of the fs impl.
>
> Bulk is a good thing but think it can be handled as well I think. Best
> would be to support it on top of vfs.
>
> I see vfs as  the pure connectivity layer allowing beam to split and
> parallel process data and not as a complete replacement.
>
>
>
> Is it the right direction for the VFS project to support the above
> changes? (things that are important to a data parallel processing system
> aren't always important for a filesystem implementation.)
>
>
> Bulk will be. Distributed peocessing will stay in beam, parallel
> processing is important for vfs IMHO since it targets plain batch (like
> jbatch) as well.
>
>
> For example, Apache Beam relies on bulk match, bulk delete, bulk rename to
> be able to do things within FileIO efficiently (Datalow has had a bunch
> of experience where renaming one file at a time even when using multiple
> threads is quite slow when you have a million files to rename so having
> bulk APIs is important). It has a registration mechanism and ties into
> PipelineOptions pretty well. In my opinion the largest deficiency I see
> with FileSystems is that we should have used URIs[1] instead of abstract
> resource types since we could standardize how URIs are resolved and how
> globs work for them, allowing FileSystem authors to implement even less.
> The tricky part is how does a URI map onto the file system correctly.
>
>
> Sounds like something unrelated to vfs right? That said it is not too late
> to use a fallback mecanism when parsing the path?
>
>
>
> 1: https://issues.apache.org/jira/browse/BEAM-2283
>
> On Mon, Mar 5, 2018 at 1:45 PM, Romain Manni-Bucau <rmannibu...@gmail.com>
> wrote:
>
>>
>>
>> Le 5 mars 2018 22:26, "Robert Bradshaw" <rober...@google.com> a écrit :
>>
>> First, let's try to make the terminology abundantly clear, as I for one
>> have (I think) misinterpreted what has been proposed.
>>
>> VfsFileSystem: A subclass of
>> https://github.com/apache/beam/blob/9f81fd299bd32e0d6056a7da9fa994cf74db0ed9/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystem.java
>>
>> VfsIO: A replacement for
>> https://github.com/apache/beam/blob/1e84e49e253f8833f28f1268bec3813029f582d0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java
>> written using Vfs instead of
>> https://github.com/apache/beam/blob/29859eb54d05b96a9db477e7bb04537510273bd2/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystems.java
>>
>>
>>
>> Ack
>>
>>
>> Between these two options, VfsFileSystem is the way to go. It will allow
>> us to use all our existing File sources/sinks (including all the fancy
>> watching/streaming support from FileIO) with any filesystem supported by
>> Vcs. Long-term, if VFS is good enough (and we'll be able to do direct
>> experiments of HadoopFileSystem vs. VfsFileSystem-on-Hadoop) we could
>> consider moving to VFS entirely and even removing the layer of indirection.
>> Vfs is a filesystem, this is the right level of abstraction to plug into.
>> Even if it's lacking in some respects, it may still be worth keeping in
>> parallel to the existing FileSystem implementations long-term if it has
>> significantly better coverage.
>>
>>
>> Ok
>>
>>
>> On the other hand, a re-implementation of FileIO on top of Vfs seems like
>> a lot of duplication of code (and ongoing maintenance cost) and will be
>> difficult to build on top of (e.g. the binding of TextIO to FileIO is not
>> dynamic like the binding of filesystems).
>>
>>
>> Well it shouldnt. Let me clarify my view: we - as asf and not just beam -
>> can make both project growing from that work and be more mature and
>> interoperable with the existing ecosystem (who does impl a beam filesystem
>> when providing a new filesystem). Interesting thing is recent java version
>> have a filesystem absstraction  too but this one is harder to make evolving
>> for our need. High level goal is to keep it ecosystem friendly and not
>> create yet another one.
>>
>>
>>
>>
>> On Mon, Mar 5, 2018 at 1:05 PM Romain Manni-Bucau <rmannibu...@gmail.com>
>> wrote:
>>
>>> Not backing vfs by a filesystem sounds saner so VfsIO is probably the
>>> way to go. It would be a FileIO concurrent and hopefully replacement on the
>>> mid/long term.
>>>
>>> What about doing the opposite: implementing a vfs filesystem for all the
>>> fs we support, potentially enrich vfs if needed? Then we can just drop beam
>>> abstraction from what i read.
>>>
>>> Le 5 mars 2018 20:49, "Reuven Lax" <re...@google.com> a écrit :
>>>
>>>> terminology is confusing here, since the existing FileIO is a
>>>> PTransform. VfsFilesystem would be a better name.
>>>>
>>>>
>>>> On Mon, Mar 5, 2018 at 11:46 AM Robert Bradshaw <rober...@google.com>
>>>> wrote:
>>>>
>>>>> On Mon, Mar 5, 2018 at 11:38 AM Reuven Lax <re...@google.com> wrote:
>>>>>
>>>>>> What about a beam Filesystem impl on top of Vfs as an alternative
>>>>>> short-term solution? This would allow Vfs to be used with any IO.
>>>>>>
>>>>>
>>>>> Yes, I think this is the VfsIO that was proposed.
>>>>>
>>>>>
>>>>>> On Mon, Mar 5, 2018 at 11:37 AM Robert Bradshaw <rober...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> On Mon, Mar 5, 2018 at 11:23 AM Romain Manni-Bucau <
>>>>>>> rmannibu...@gmail.com> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> 2018-03-05 20:04 GMT+01:00 Chamikara Jayalath <chamik...@google.com
>>>>>>>> >:
>>>>>>>>
>>>>>>>>> I assume you mean https://commons.apache.org/proper/commons-vfs/.
>>>>>>>>>
>>>>>>>>> I'm not sure if we considered this when we originally implemented
>>>>>>>>> our own file-system abstraction but based on a quick look seems like 
>>>>>>>>> this
>>>>>>>>> is Java only.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Yes, java only
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I think having a similar file-system abstraction for various
>>>>>>>>> languages is a plus point for Beam. May be we should consider a Java
>>>>>>>>> file-system implementation for VFS ?
>>>>>>>>>
>>>>>>>>
>>>>>>>> Can be an option but when I see the current complexity I'm not sure
>>>>>>>> mixing 2 abstractions would help, maybe just a VfsIO for java users 
>>>>>>>> would
>>>>>>>> be good enough - thinking out loud.
>>>>>>>>
>>>>>>>> What sounds clear to me is that each language will need its own
>>>>>>>> abstraction - which kind of join your proposal. However we can still 
>>>>>>>> make
>>>>>>>> it smooth and easy on the java side - which
>>>>>>>> will likely stay mainstream for still some years - using vfs as our
>>>>>>>> java impl instead of reimplementing the full abstraction? This way we 
>>>>>>>> keep
>>>>>>>> our *API* but we drop beam *impl* to just reuse VFS.
>>>>>>>>
>>>>>>>> PS: for gcs https://github.com/ltouati/vfs-gcs can be a good
>>>>>>>> example on how it can work.
>>>>>>>>
>>>>>>>
>>>>>>> I think a VfsIO makes a lot of sense in the short term, and will
>>>>>>> give use the experience needed to decide if we can move solely to VFS 
>>>>>>> (for
>>>>>>> Java at least) for implementation, and possibly API in a future major
>>>>>>> release, in the long run.
>>>>>>>
>>>>>>
>>
>
>

Reply via email to