On Wed, Jan 30, 2019 at 2:37 PM Chamikara Jayalath <chamik...@google.com> wrote:
> > > On Wed, Jan 30, 2019 at 2:33 PM Ismaël Mejía <ieme...@gmail.com> wrote: > >> Ups slight typo, in the first line of the previous email I meant read >> instead of readAll >> >> On Wed, Jan 30, 2019 at 11:32 PM Ismaël Mejía <ieme...@gmail.com> wrote: >> > >> > Reuven is right for the example, readAll at this moment may be faster >> > and also supports Dynamic Work Rebalancing (DWR), but the performance >> > of the other approach may (and must) be improved to be equal, once the >> > internal implementation of TextIO.read moves to a SDF version instead >> > of the FileBasedSource one, and once that runners support DWR through >> > SDF. Of course all of this is future work. Probably Eugene can >> > eventually chime in to give more details in practical performance in >> > his tests in Dataflow. >> > >> > Really interesting topic, but I want to bring back the discussion to >> > the subject of the thread. I think there is some confusion after >> > Jeff's example which should have been: >> > >> > return input >> > .apply(TextIO.readAll()); >> > >> > to: >> > >> > return input >> > .apply(FileIO.match().filepattern(fileSpec)) >> > .apply(FileIO.readMatches()) >> > .apply(TextIO.readFiles()); >> > >> > This is the question we are addressing, do we need a readAll transform >> > that replaces the 3 steps or no? >> > > Ismaël, I'm not quite sure how these two are equal. readFiles() transform > returns a PCollection of ReadableFile objects. Users are expected to read > these files in a subsequent ParDo and produce a PCollection of proper type. > FooIO.ReadAll() transforms on the other hand are tailored to each IO > connector and return a PCollection of objects of type that are supported to > be returned by that IO connector. > I assume you meant FileIO.readFiles() here. Or did you mean TextIO.readFiles() ? If so that seems very similar to TextIO.readAll(). > > > >> > >> > On Wed, Jan 30, 2019 at 9:03 PM Robert Bradshaw <rober...@google.com> >> wrote: >> > > >> > > Yes, this is precisely the goal of SDF. >> > > >> > > >> > > On Wed, Jan 30, 2019 at 8:41 PM Kenneth Knowles <k...@google.com> >> wrote: >> > > > >> > > > So is the latter is intended for splittable DoFn but not yet using >> it? The promise of SDF is precisely this composability, isn't it? >> > > > >> > > > Kenn >> > > > >> > > > On Wed, Jan 30, 2019 at 10:16 AM Jeff Klukas <jklu...@mozilla.com> >> wrote: >> > > >> >> > > >> Reuven - Is TextIO.read().from() a more complex case than the >> topic Ismaël is bringing up in this thread? I'm surprised to hear that the >> two examples have different performance characteristics. >> > > >> >> > > >> Reading through the implementation, I guess the fundamental >> difference is whether a given configuration expands to TextIO.ReadAll or to >> io.Read. AFAICT, that detail and the subsequent performance impact is not >> documented. >> > > >> >> > > >> If the above is correct, perhaps it's an argument for IOs to >> provide higher-level methods in cases where they can optimize performance >> compared to what a user might naively put together. >> > > >> >> > > >> On Wed, Jan 30, 2019 at 12:35 PM Reuven Lax <re...@google.com> >> wrote: >> > > >>> >> > > >>> Jeff, what you did here is not simply a refactoring. These two >> are quite different, and will likely have different performance >> characteristics. >> > > >>> >> > > >>> The first evaluates the wildcard, and allows the runner to pick >> appropriate bundling. Bundles might contain multiple files (if they are >> small), and the runner can split the files as appropriate. In the case of >> the Dataflow runner, these bundles can be further split dynamically. >> > > >>> >> > > >>> The second chops of the files inside the the PTransform, and >> processes each chunk in a ParDo. TextIO.readFiles currently chops up each >> file into 64mb chunks (hardcoded), and then processes each chunk in a ParDo. >> > > >>> >> > > >>> Reuven >> > > >>> >> > > >>> >> > > >>> On Wed, Jan 30, 2019 at 9:18 AM Jeff Klukas <jklu...@mozilla.com> >> wrote: >> > > >>>> >> > > >>>> I would prefer we move towards option [2]. I just tried the >> following refactor in my own code from: >> > > >>>> >> > > >>>> return input >> > > >>>> .apply(TextIO.read().from(fileSpec)); >> > > >>>> >> > > >>>> to: >> > > >>>> >> > > >>>> return input >> > > >>>> .apply(FileIO.match().filepattern(fileSpec)) >> > > >>>> .apply(FileIO.readMatches()) >> > > >>>> .apply(TextIO.readFiles()); >> > > >>>> >> > > >>>> Yes, the latter is more verbose but not ridiculously so, and >> it's also more instructive about what's happening. >> > > >>>> >> > > >>>> When I first started working with Beam, it took me a while to >> realize that TextIO.read().from() would accept a wildcard. The more verbose >> version involves a method called "filepattern" which makes this much more >> obvious. It also leads me to understand that I could use the same >> FileIO.match() machinery to do other things with filesystems other than >> read file contents. >> > > >>>> >> > > >>>> On Wed, Jan 30, 2019 at 11:26 AM Ismaël Mejía <ieme...@gmail.com> >> wrote: >> > > >>>>> >> > > >>>>> Hello, >> > > >>>>> >> > > >>>>> A ‘recent’ pattern of use in Beam is to have in file based IOs a >> > > >>>>> `readAll()` implementation that basically matches a >> `PCollection` of >> > > >>>>> file patterns and reads them, e.g. `TextIO`, `AvroIO`. >> `ReadAll` is >> > > >>>>> implemented by a expand function that matches files with FileIO >> and >> > > >>>>> then reads them using a format specific `ReadFiles` transform >> e.g. >> > > >>>>> TextIO.ReadFiles, AvroIO.ReadFiles. So in the end `ReadAll` in >> the >> > > >>>>> Java implementation is just an user friendly API to hide >> FileIO.match >> > > >>>>> + ReadFiles. >> > > >>>>> >> > > >>>>> Most recent IOs do NOT implement ReadAll to encourage the more >> > > >>>>> composable approach of File + ReadFiles, e.g. XmlIO and >> ParquetIO. >> > > >>>>> >> > > >>>>> Implementing ReadAll as a wrapper is relatively easy and is >> definitely >> > > >>>>> user friendly, but it has an issue, it may be error-prone and >> it adds >> > > >>>>> more code to maintain (mostly ‘repeated’ code). However >> `readAll` is a >> > > >>>>> more abstract pattern that applies not only to File based IOs >> so it >> > > >>>>> makes sense for example in other transforms that map a >> `Pcollection` >> > > >>>>> of read requests and is the basis for SDF composable style APIs >> like >> > > >>>>> the recent `HBaseIO.readAll()`. >> > > >>>>> >> > > >>>>> So the question is should we: >> > > >>>>> >> > > >>>>> [1] Implement `readAll` in all file based IOs to be user >> friendly and >> > > >>>>> assume the (minor) maintenance cost >> > > >>>>> >> > > >>>>> or >> > > >>>>> >> > > >>>>> [2] Deprecate `readAll` from file based IOs and encourage users >> to use >> > > >>>>> FileIO + `readFiles` (less maintenance and encourage >> composition). >> > > >>>>> >> > > >>>>> I just checked quickly in the python code base but I did not >> find if >> > > >>>>> the File match + ReadFiles pattern applies, but it would be >> nice to >> > > >>>>> see what the python guys think on this too. >> > > >>>>> >> > > >>>>> This discussion comes from a recent slack conversation with >> Łukasz >> > > >>>>> Gajowy, and we wanted to settle into one approach to make the IO >> > > >>>>> signatures consistent, so any opinions/preferences? >> >