On Wed, Jan 30, 2019 at 2:33 PM Ismaël Mejía <ieme...@gmail.com> wrote:

> Ups slight typo, in the first line of the previous email I meant read
> instead of readAll
>
> On Wed, Jan 30, 2019 at 11:32 PM Ismaël Mejía <ieme...@gmail.com> wrote:
> >
> > Reuven is right for the example, readAll at this moment may be faster
> > and also supports Dynamic Work Rebalancing (DWR), but the performance
> > of the other approach may (and must) be improved to be equal, once the
> > internal implementation of TextIO.read moves to a SDF version instead
> > of the FileBasedSource one, and once that runners support DWR through
> > SDF. Of course all of this is future work. Probably Eugene can
> > eventually chime in to give more details in practical performance in
> > his tests in Dataflow.
> >
> > Really interesting topic, but I want to bring back the discussion to
> > the subject of the thread. I think there is some confusion after
> > Jeff's example which should have been:
> >
> >       return input
> >           .apply(TextIO.readAll());
> >
> > to:
> >
> >       return input
> >           .apply(FileIO.match().filepattern(fileSpec))
> >           .apply(FileIO.readMatches())
> >           .apply(TextIO.readFiles());
> >
> > This is the question we are addressing, do we need a readAll transform
> > that replaces the 3 steps or no?
>

Ismaël, I'm not quite sure how these two are equal. readFiles() transform
returns a PCollection of ReadableFile objects. Users are expected to read
these files in a subsequent ParDo and produce a PCollection of proper type.
FooIO.ReadAll() transforms on the other hand are tailored to each IO
connector and return a PCollection of objects of type that are supported to
be returned by that IO connector.



> >
> > On Wed, Jan 30, 2019 at 9:03 PM Robert Bradshaw <rober...@google.com>
> wrote:
> > >
> > > Yes, this is precisely the goal of SDF.
> > >
> > >
> > > On Wed, Jan 30, 2019 at 8:41 PM Kenneth Knowles <k...@google.com>
> wrote:
> > > >
> > > > So is the latter is intended for splittable DoFn but not yet using
> it? The promise of SDF is precisely this composability, isn't it?
> > > >
> > > > Kenn
> > > >
> > > > On Wed, Jan 30, 2019 at 10:16 AM Jeff Klukas <jklu...@mozilla.com>
> wrote:
> > > >>
> > > >> Reuven - Is TextIO.read().from() a more complex case than the topic
> Ismaël is bringing up in this thread? I'm surprised to hear that the two
> examples have different performance characteristics.
> > > >>
> > > >> Reading through the implementation, I guess the fundamental
> difference is whether a given configuration expands to TextIO.ReadAll or to
> io.Read. AFAICT, that detail and the subsequent performance impact is not
> documented.
> > > >>
> > > >> If the above is correct, perhaps it's an argument for IOs to
> provide higher-level methods in cases where they can optimize performance
> compared to what a user might naively put together.
> > > >>
> > > >> On Wed, Jan 30, 2019 at 12:35 PM Reuven Lax <re...@google.com>
> wrote:
> > > >>>
> > > >>> Jeff, what you did here is not simply a refactoring. These two are
> quite different, and will likely have different performance characteristics.
> > > >>>
> > > >>> The first evaluates the wildcard, and allows the runner to pick
> appropriate bundling. Bundles might contain multiple files (if they are
> small), and the runner can split the files as appropriate. In the case of
> the Dataflow runner, these bundles can be further split dynamically.
> > > >>>
> > > >>> The second chops of the files inside the the PTransform, and
> processes each chunk in a ParDo. TextIO.readFiles currently chops up each
> file into 64mb chunks (hardcoded), and then processes each chunk in a ParDo.
> > > >>>
> > > >>> Reuven
> > > >>>
> > > >>>
> > > >>> On Wed, Jan 30, 2019 at 9:18 AM Jeff Klukas <jklu...@mozilla.com>
> wrote:
> > > >>>>
> > > >>>> I would prefer we move towards option [2]. I just tried the
> following refactor in my own code from:
> > > >>>>
> > > >>>>       return input
> > > >>>>           .apply(TextIO.read().from(fileSpec));
> > > >>>>
> > > >>>> to:
> > > >>>>
> > > >>>>       return input
> > > >>>>           .apply(FileIO.match().filepattern(fileSpec))
> > > >>>>           .apply(FileIO.readMatches())
> > > >>>>           .apply(TextIO.readFiles());
> > > >>>>
> > > >>>> Yes, the latter is more verbose but not ridiculously so, and it's
> also more instructive about what's happening.
> > > >>>>
> > > >>>> When I first started working with Beam, it took me a while to
> realize that TextIO.read().from() would accept a wildcard. The more verbose
> version involves a method called "filepattern" which makes this much more
> obvious. It also leads me to understand that I could use the same
> FileIO.match() machinery to do other things with filesystems other than
> read file contents.
> > > >>>>
> > > >>>> On Wed, Jan 30, 2019 at 11:26 AM Ismaël Mejía <ieme...@gmail.com>
> wrote:
> > > >>>>>
> > > >>>>> Hello,
> > > >>>>>
> > > >>>>> A ‘recent’ pattern of use in Beam is to have in file based IOs a
> > > >>>>> `readAll()` implementation that basically matches a
> `PCollection` of
> > > >>>>> file patterns and reads them, e.g. `TextIO`, `AvroIO`. `ReadAll`
> is
> > > >>>>> implemented by a expand function that matches files with FileIO
> and
> > > >>>>> then reads them using a format specific `ReadFiles` transform
> e.g.
> > > >>>>> TextIO.ReadFiles, AvroIO.ReadFiles. So in the end `ReadAll` in
> the
> > > >>>>> Java implementation is just an user friendly API to hide
> FileIO.match
> > > >>>>> + ReadFiles.
> > > >>>>>
> > > >>>>> Most recent IOs do NOT implement ReadAll to encourage the more
> > > >>>>> composable approach of File + ReadFiles, e.g. XmlIO and
> ParquetIO.
> > > >>>>>
> > > >>>>> Implementing ReadAll as a wrapper is relatively easy and is
> definitely
> > > >>>>> user friendly, but it has an  issue, it may be error-prone and
> it adds
> > > >>>>> more code to maintain (mostly ‘repeated’ code). However
> `readAll` is a
> > > >>>>> more abstract pattern that applies not only to File based IOs so
> it
> > > >>>>> makes sense for example in other transforms that map a
> `Pcollection`
> > > >>>>> of read requests and is the basis for SDF composable style APIs
> like
> > > >>>>> the recent `HBaseIO.readAll()`.
> > > >>>>>
> > > >>>>> So the question is should we:
> > > >>>>>
> > > >>>>> [1] Implement `readAll` in all file based IOs to be user
> friendly and
> > > >>>>> assume the (minor) maintenance cost
> > > >>>>>
> > > >>>>> or
> > > >>>>>
> > > >>>>> [2] Deprecate `readAll` from file based IOs and encourage users
> to use
> > > >>>>> FileIO + `readFiles` (less maintenance and encourage
> composition).
> > > >>>>>
> > > >>>>> I just checked quickly in the python code base but I did not
> find if
> > > >>>>> the File match + ReadFiles pattern applies, but it would be nice
> to
> > > >>>>> see what the python guys think on this too.
> > > >>>>>
> > > >>>>> This discussion comes from a recent slack conversation with
> Łukasz
> > > >>>>> Gajowy, and we wanted to settle into one approach to make the IO
> > > >>>>> signatures consistent, so any opinions/preferences?
>

Reply via email to