I want to chime in that I am also +1 to deprecating readAll, is there anyone strongly pro readAll instead of the explicit composition? And more important, can the Python/Go SDK align with this (deprecating ReadAll and implementing ReadFiles)?
On Thu, Jan 31, 2019 at 12:34 AM Chamikara Jayalath <chamik...@google.com> wrote: > > Thanks for the clarification Ismaël and Eugene. +1 for deprecating existing > FooIO.readAll() transforms in favor of FooIO.readFiles(). > > On Wed, Jan 30, 2019 at 3:25 PM Eugene Kirpichov <kirpic...@google.com> wrote: >> >> TextIO.read() and AvroIO.read() indeed perform better than match() + >> readMatches() + readFiles(), due to DWR - so for these two in particular I >> would not recommend such a refactoring. >> However, new file-based IOs that do not support DWR should only provide >> readFiles(). Those that do, should provide read() and readFiles(). When SDF >> supports DWR, then readFiles() will be enough in all cases. >> In general there's no need for readAll() for new file-based IOs - it is >> always equivalent to matchAll() + readMatches() + readFiles() including >> performance-wise. It was included in TextIO/AvroIO before readFiles() was a >> thing. >> >> On Wed, Jan 30, 2019 at 2:41 PM Chamikara Jayalath <chamik...@google.com> >> wrote: >>> >>> On Wed, Jan 30, 2019 at 2:37 PM Chamikara Jayalath <chamik...@google.com> >>> wrote: >>>> >>>> >>>> >>>> On Wed, Jan 30, 2019 at 2:33 PM Ismaël Mejía <ieme...@gmail.com> wrote: >>>>> >>>>> Ups slight typo, in the first line of the previous email I meant read >>>>> instead of readAll >>>>> >>>>> On Wed, Jan 30, 2019 at 11:32 PM Ismaël Mejía <ieme...@gmail.com> wrote: >>>>> > >>>>> > Reuven is right for the example, readAll at this moment may be faster >>>>> > and also supports Dynamic Work Rebalancing (DWR), but the performance >>>>> > of the other approach may (and must) be improved to be equal, once the >>>>> > internal implementation of TextIO.read moves to a SDF version instead >>>>> > of the FileBasedSource one, and once that runners support DWR through >>>>> > SDF. Of course all of this is future work. Probably Eugene can >>>>> > eventually chime in to give more details in practical performance in >>>>> > his tests in Dataflow. >>>>> > >>>>> > Really interesting topic, but I want to bring back the discussion to >>>>> > the subject of the thread. I think there is some confusion after >>>>> > Jeff's example which should have been: >>>>> > >>>>> > return input >>>>> > .apply(TextIO.readAll()); >>>>> > >>>>> > to: >>>>> > >>>>> > return input >>>>> > .apply(FileIO.match().filepattern(fileSpec)) >>>>> > .apply(FileIO.readMatches()) >>>>> > .apply(TextIO.readFiles()); >>>>> > >>>>> > This is the question we are addressing, do we need a readAll transform >>>>> > that replaces the 3 steps or no? >>>> >>>> >>>> Ismaël, I'm not quite sure how these two are equal. readFiles() transform >>>> returns a PCollection of ReadableFile objects. Users are expected to read >>>> these files in a subsequent ParDo and produce a PCollection of proper >>>> type. FooIO.ReadAll() transforms on the other hand are tailored to each IO >>>> connector and return a PCollection of objects of type that are supported >>>> to be returned by that IO connector. >>> >>> >>> I assume you meant FileIO.readFiles() here. Or did you mean >>> TextIO.readFiles() ? If so that seems very similar to TextIO.readAll(). >>>> >>>> >>>> >>>>> >>>>> > >>>>> > On Wed, Jan 30, 2019 at 9:03 PM Robert Bradshaw <rober...@google.com> >>>>> > wrote: >>>>> > > >>>>> > > Yes, this is precisely the goal of SDF. >>>>> > > >>>>> > > >>>>> > > On Wed, Jan 30, 2019 at 8:41 PM Kenneth Knowles <k...@google.com> >>>>> > > wrote: >>>>> > > > >>>>> > > > So is the latter is intended for splittable DoFn but not yet using >>>>> > > > it? The promise of SDF is precisely this composability, isn't it? >>>>> > > > >>>>> > > > Kenn >>>>> > > > >>>>> > > > On Wed, Jan 30, 2019 at 10:16 AM Jeff Klukas <jklu...@mozilla.com> >>>>> > > > wrote: >>>>> > > >> >>>>> > > >> Reuven - Is TextIO.read().from() a more complex case than the >>>>> > > >> topic Ismaël is bringing up in this thread? I'm surprised to hear >>>>> > > >> that the two examples have different performance characteristics. >>>>> > > >> >>>>> > > >> Reading through the implementation, I guess the fundamental >>>>> > > >> difference is whether a given configuration expands to >>>>> > > >> TextIO.ReadAll or to io.Read. AFAICT, that detail and the >>>>> > > >> subsequent performance impact is not documented. >>>>> > > >> >>>>> > > >> If the above is correct, perhaps it's an argument for IOs to >>>>> > > >> provide higher-level methods in cases where they can optimize >>>>> > > >> performance compared to what a user might naively put together. >>>>> > > >> >>>>> > > >> On Wed, Jan 30, 2019 at 12:35 PM Reuven Lax <re...@google.com> >>>>> > > >> wrote: >>>>> > > >>> >>>>> > > >>> Jeff, what you did here is not simply a refactoring. These two >>>>> > > >>> are quite different, and will likely have different performance >>>>> > > >>> characteristics. >>>>> > > >>> >>>>> > > >>> The first evaluates the wildcard, and allows the runner to pick >>>>> > > >>> appropriate bundling. Bundles might contain multiple files (if >>>>> > > >>> they are small), and the runner can split the files as >>>>> > > >>> appropriate. In the case of the Dataflow runner, these bundles >>>>> > > >>> can be further split dynamically. >>>>> > > >>> >>>>> > > >>> The second chops of the files inside the the PTransform, and >>>>> > > >>> processes each chunk in a ParDo. TextIO.readFiles currently chops >>>>> > > >>> up each file into 64mb chunks (hardcoded), and then processes >>>>> > > >>> each chunk in a ParDo. >>>>> > > >>> >>>>> > > >>> Reuven >>>>> > > >>> >>>>> > > >>> >>>>> > > >>> On Wed, Jan 30, 2019 at 9:18 AM Jeff Klukas <jklu...@mozilla.com> >>>>> > > >>> wrote: >>>>> > > >>>> >>>>> > > >>>> I would prefer we move towards option [2]. I just tried the >>>>> > > >>>> following refactor in my own code from: >>>>> > > >>>> >>>>> > > >>>> return input >>>>> > > >>>> .apply(TextIO.read().from(fileSpec)); >>>>> > > >>>> >>>>> > > >>>> to: >>>>> > > >>>> >>>>> > > >>>> return input >>>>> > > >>>> .apply(FileIO.match().filepattern(fileSpec)) >>>>> > > >>>> .apply(FileIO.readMatches()) >>>>> > > >>>> .apply(TextIO.readFiles()); >>>>> > > >>>> >>>>> > > >>>> Yes, the latter is more verbose but not ridiculously so, and >>>>> > > >>>> it's also more instructive about what's happening. >>>>> > > >>>> >>>>> > > >>>> When I first started working with Beam, it took me a while to >>>>> > > >>>> realize that TextIO.read().from() would accept a wildcard. The >>>>> > > >>>> more verbose version involves a method called "filepattern" >>>>> > > >>>> which makes this much more obvious. It also leads me to >>>>> > > >>>> understand that I could use the same FileIO.match() machinery to >>>>> > > >>>> do other things with filesystems other than read file contents. >>>>> > > >>>> >>>>> > > >>>> On Wed, Jan 30, 2019 at 11:26 AM Ismaël Mejía >>>>> > > >>>> <ieme...@gmail.com> wrote: >>>>> > > >>>>> >>>>> > > >>>>> Hello, >>>>> > > >>>>> >>>>> > > >>>>> A ‘recent’ pattern of use in Beam is to have in file based IOs a >>>>> > > >>>>> `readAll()` implementation that basically matches a >>>>> > > >>>>> `PCollection` of >>>>> > > >>>>> file patterns and reads them, e.g. `TextIO`, `AvroIO`. >>>>> > > >>>>> `ReadAll` is >>>>> > > >>>>> implemented by a expand function that matches files with FileIO >>>>> > > >>>>> and >>>>> > > >>>>> then reads them using a format specific `ReadFiles` transform >>>>> > > >>>>> e.g. >>>>> > > >>>>> TextIO.ReadFiles, AvroIO.ReadFiles. So in the end `ReadAll` in >>>>> > > >>>>> the >>>>> > > >>>>> Java implementation is just an user friendly API to hide >>>>> > > >>>>> FileIO.match >>>>> > > >>>>> + ReadFiles. >>>>> > > >>>>> >>>>> > > >>>>> Most recent IOs do NOT implement ReadAll to encourage the more >>>>> > > >>>>> composable approach of File + ReadFiles, e.g. XmlIO and >>>>> > > >>>>> ParquetIO. >>>>> > > >>>>> >>>>> > > >>>>> Implementing ReadAll as a wrapper is relatively easy and is >>>>> > > >>>>> definitely >>>>> > > >>>>> user friendly, but it has an issue, it may be error-prone and >>>>> > > >>>>> it adds >>>>> > > >>>>> more code to maintain (mostly ‘repeated’ code). However >>>>> > > >>>>> `readAll` is a >>>>> > > >>>>> more abstract pattern that applies not only to File based IOs >>>>> > > >>>>> so it >>>>> > > >>>>> makes sense for example in other transforms that map a >>>>> > > >>>>> `Pcollection` >>>>> > > >>>>> of read requests and is the basis for SDF composable style APIs >>>>> > > >>>>> like >>>>> > > >>>>> the recent `HBaseIO.readAll()`. >>>>> > > >>>>> >>>>> > > >>>>> So the question is should we: >>>>> > > >>>>> >>>>> > > >>>>> [1] Implement `readAll` in all file based IOs to be user >>>>> > > >>>>> friendly and >>>>> > > >>>>> assume the (minor) maintenance cost >>>>> > > >>>>> >>>>> > > >>>>> or >>>>> > > >>>>> >>>>> > > >>>>> [2] Deprecate `readAll` from file based IOs and encourage users >>>>> > > >>>>> to use >>>>> > > >>>>> FileIO + `readFiles` (less maintenance and encourage >>>>> > > >>>>> composition). >>>>> > > >>>>> >>>>> > > >>>>> I just checked quickly in the python code base but I did not >>>>> > > >>>>> find if >>>>> > > >>>>> the File match + ReadFiles pattern applies, but it would be >>>>> > > >>>>> nice to >>>>> > > >>>>> see what the python guys think on this too. >>>>> > > >>>>> >>>>> > > >>>>> This discussion comes from a recent slack conversation with >>>>> > > >>>>> Łukasz >>>>> > > >>>>> Gajowy, and we wanted to settle into one approach to make the IO >>>>> > > >>>>> signatures consistent, so any opinions/preferences?