Re: [DISCUSS] Should File based IOs implement readAll() or just readFiles()

Ismaël Mejía Wed, 30 Jan 2019 14:33:24 -0800

Ups slight typo, in the first line of the previous email I meant read
instead of readAll


On Wed, Jan 30, 2019 at 11:32 PM Ismaël Mejía <ieme...@gmail.com> wrote:
>
> Reuven is right for the example, readAll at this moment may be faster
> and also supports Dynamic Work Rebalancing (DWR), but the performance
> of the other approach may (and must) be improved to be equal, once the
> internal implementation of TextIO.read moves to a SDF version instead
> of the FileBasedSource one, and once that runners support DWR through
> SDF. Of course all of this is future work. Probably Eugene can
> eventually chime in to give more details in practical performance in
> his tests in Dataflow.
>
> Really interesting topic, but I want to bring back the discussion to
> the subject of the thread. I think there is some confusion after
> Jeff's example which should have been:
>
>       return input
>           .apply(TextIO.readAll());
>
> to:
>
>       return input
>           .apply(FileIO.match().filepattern(fileSpec))
>           .apply(FileIO.readMatches())
>           .apply(TextIO.readFiles());
>
> This is the question we are addressing, do we need a readAll transform
> that replaces the 3 steps or no?
>
> On Wed, Jan 30, 2019 at 9:03 PM Robert Bradshaw <rober...@google.com> wrote:
> >
> > Yes, this is precisely the goal of SDF.
> >
> >
> > On Wed, Jan 30, 2019 at 8:41 PM Kenneth Knowles <k...@google.com> wrote:
> > >
> > > So is the latter is intended for splittable DoFn but not yet using it? 
> > > The promise of SDF is precisely this composability, isn't it?
> > >
> > > Kenn
> > >
> > > On Wed, Jan 30, 2019 at 10:16 AM Jeff Klukas <jklu...@mozilla.com> wrote:
> > >>
> > >> Reuven - Is TextIO.read().from() a more complex case than the topic 
> > >> Ismaël is bringing up in this thread? I'm surprised to hear that the two 
> > >> examples have different performance characteristics.
> > >>
> > >> Reading through the implementation, I guess the fundamental difference 
> > >> is whether a given configuration expands to TextIO.ReadAll or to 
> > >> io.Read. AFAICT, that detail and the subsequent performance impact is 
> > >> not documented.
> > >>
> > >> If the above is correct, perhaps it's an argument for IOs to provide 
> > >> higher-level methods in cases where they can optimize performance 
> > >> compared to what a user might naively put together.
> > >>
> > >> On Wed, Jan 30, 2019 at 12:35 PM Reuven Lax <re...@google.com> wrote:
> > >>>
> > >>> Jeff, what you did here is not simply a refactoring. These two are 
> > >>> quite different, and will likely have different performance 
> > >>> characteristics.
> > >>>
> > >>> The first evaluates the wildcard, and allows the runner to pick 
> > >>> appropriate bundling. Bundles might contain multiple files (if they are 
> > >>> small), and the runner can split the files as appropriate. In the case 
> > >>> of the Dataflow runner, these bundles can be further split dynamically.
> > >>>
> > >>> The second chops of the files inside the the PTransform, and processes 
> > >>> each chunk in a ParDo. TextIO.readFiles currently chops up each file 
> > >>> into 64mb chunks (hardcoded), and then processes each chunk in a ParDo.
> > >>>
> > >>> Reuven
> > >>>
> > >>>
> > >>> On Wed, Jan 30, 2019 at 9:18 AM Jeff Klukas <jklu...@mozilla.com> wrote:
> > >>>>
> > >>>> I would prefer we move towards option [2]. I just tried the following 
> > >>>> refactor in my own code from:
> > >>>>
> > >>>>       return input
> > >>>>           .apply(TextIO.read().from(fileSpec));
> > >>>>
> > >>>> to:
> > >>>>
> > >>>>       return input
> > >>>>           .apply(FileIO.match().filepattern(fileSpec))
> > >>>>           .apply(FileIO.readMatches())
> > >>>>           .apply(TextIO.readFiles());
> > >>>>
> > >>>> Yes, the latter is more verbose but not ridiculously so, and it's also 
> > >>>> more instructive about what's happening.
> > >>>>
> > >>>> When I first started working with Beam, it took me a while to realize 
> > >>>> that TextIO.read().from() would accept a wildcard. The more verbose 
> > >>>> version involves a method called "filepattern" which makes this much 
> > >>>> more obvious. It also leads me to understand that I could use the same 
> > >>>> FileIO.match() machinery to do other things with filesystems other 
> > >>>> than read file contents.
> > >>>>
> > >>>> On Wed, Jan 30, 2019 at 11:26 AM Ismaël Mejía <ieme...@gmail.com> 
> > >>>> wrote:
> > >>>>>
> > >>>>> Hello,
> > >>>>>
> > >>>>> A ‘recent’ pattern of use in Beam is to have in file based IOs a
> > >>>>> `readAll()` implementation that basically matches a `PCollection` of
> > >>>>> file patterns and reads them, e.g. `TextIO`, `AvroIO`. `ReadAll` is
> > >>>>> implemented by a expand function that matches files with FileIO and
> > >>>>> then reads them using a format specific `ReadFiles` transform e.g.
> > >>>>> TextIO.ReadFiles, AvroIO.ReadFiles. So in the end `ReadAll` in the
> > >>>>> Java implementation is just an user friendly API to hide FileIO.match
> > >>>>> + ReadFiles.
> > >>>>>
> > >>>>> Most recent IOs do NOT implement ReadAll to encourage the more
> > >>>>> composable approach of File + ReadFiles, e.g. XmlIO and ParquetIO.
> > >>>>>
> > >>>>> Implementing ReadAll as a wrapper is relatively easy and is definitely
> > >>>>> user friendly, but it has an  issue, it may be error-prone and it adds
> > >>>>> more code to maintain (mostly ‘repeated’ code). However `readAll` is a
> > >>>>> more abstract pattern that applies not only to File based IOs so it
> > >>>>> makes sense for example in other transforms that map a `Pcollection`
> > >>>>> of read requests and is the basis for SDF composable style APIs like
> > >>>>> the recent `HBaseIO.readAll()`.
> > >>>>>
> > >>>>> So the question is should we:
> > >>>>>
> > >>>>> [1] Implement `readAll` in all file based IOs to be user friendly and
> > >>>>> assume the (minor) maintenance cost
> > >>>>>
> > >>>>> or
> > >>>>>
> > >>>>> [2] Deprecate `readAll` from file based IOs and encourage users to use
> > >>>>> FileIO + `readFiles` (less maintenance and encourage composition).
> > >>>>>
> > >>>>> I just checked quickly in the python code base but I did not find if
> > >>>>> the File match + ReadFiles pattern applies, but it would be nice to
> > >>>>> see what the python guys think on this too.
> > >>>>>
> > >>>>> This discussion comes from a recent slack conversation with Łukasz
> > >>>>> Gajowy, and we wanted to settle into one approach to make the IO
> > >>>>> signatures consistent, so any opinions/preferences?

Re: [DISCUSS] Should File based IOs implement readAll() or just readFiles()

Reply via email to