Yes, this is precisely the goal of SDF.
On Wed, Jan 30, 2019 at 8:41 PM Kenneth Knowles <k...@google.com> wrote: > > So is the latter is intended for splittable DoFn but not yet using it? The > promise of SDF is precisely this composability, isn't it? > > Kenn > > On Wed, Jan 30, 2019 at 10:16 AM Jeff Klukas <jklu...@mozilla.com> wrote: >> >> Reuven - Is TextIO.read().from() a more complex case than the topic Ismaël >> is bringing up in this thread? I'm surprised to hear that the two examples >> have different performance characteristics. >> >> Reading through the implementation, I guess the fundamental difference is >> whether a given configuration expands to TextIO.ReadAll or to io.Read. >> AFAICT, that detail and the subsequent performance impact is not documented. >> >> If the above is correct, perhaps it's an argument for IOs to provide >> higher-level methods in cases where they can optimize performance compared >> to what a user might naively put together. >> >> On Wed, Jan 30, 2019 at 12:35 PM Reuven Lax <re...@google.com> wrote: >>> >>> Jeff, what you did here is not simply a refactoring. These two are quite >>> different, and will likely have different performance characteristics. >>> >>> The first evaluates the wildcard, and allows the runner to pick appropriate >>> bundling. Bundles might contain multiple files (if they are small), and the >>> runner can split the files as appropriate. In the case of the Dataflow >>> runner, these bundles can be further split dynamically. >>> >>> The second chops of the files inside the the PTransform, and processes each >>> chunk in a ParDo. TextIO.readFiles currently chops up each file into 64mb >>> chunks (hardcoded), and then processes each chunk in a ParDo. >>> >>> Reuven >>> >>> >>> On Wed, Jan 30, 2019 at 9:18 AM Jeff Klukas <jklu...@mozilla.com> wrote: >>>> >>>> I would prefer we move towards option [2]. I just tried the following >>>> refactor in my own code from: >>>> >>>> return input >>>> .apply(TextIO.read().from(fileSpec)); >>>> >>>> to: >>>> >>>> return input >>>> .apply(FileIO.match().filepattern(fileSpec)) >>>> .apply(FileIO.readMatches()) >>>> .apply(TextIO.readFiles()); >>>> >>>> Yes, the latter is more verbose but not ridiculously so, and it's also >>>> more instructive about what's happening. >>>> >>>> When I first started working with Beam, it took me a while to realize that >>>> TextIO.read().from() would accept a wildcard. The more verbose version >>>> involves a method called "filepattern" which makes this much more obvious. >>>> It also leads me to understand that I could use the same FileIO.match() >>>> machinery to do other things with filesystems other than read file >>>> contents. >>>> >>>> On Wed, Jan 30, 2019 at 11:26 AM Ismaël Mejía <ieme...@gmail.com> wrote: >>>>> >>>>> Hello, >>>>> >>>>> A ‘recent’ pattern of use in Beam is to have in file based IOs a >>>>> `readAll()` implementation that basically matches a `PCollection` of >>>>> file patterns and reads them, e.g. `TextIO`, `AvroIO`. `ReadAll` is >>>>> implemented by a expand function that matches files with FileIO and >>>>> then reads them using a format specific `ReadFiles` transform e.g. >>>>> TextIO.ReadFiles, AvroIO.ReadFiles. So in the end `ReadAll` in the >>>>> Java implementation is just an user friendly API to hide FileIO.match >>>>> + ReadFiles. >>>>> >>>>> Most recent IOs do NOT implement ReadAll to encourage the more >>>>> composable approach of File + ReadFiles, e.g. XmlIO and ParquetIO. >>>>> >>>>> Implementing ReadAll as a wrapper is relatively easy and is definitely >>>>> user friendly, but it has an issue, it may be error-prone and it adds >>>>> more code to maintain (mostly ‘repeated’ code). However `readAll` is a >>>>> more abstract pattern that applies not only to File based IOs so it >>>>> makes sense for example in other transforms that map a `Pcollection` >>>>> of read requests and is the basis for SDF composable style APIs like >>>>> the recent `HBaseIO.readAll()`. >>>>> >>>>> So the question is should we: >>>>> >>>>> [1] Implement `readAll` in all file based IOs to be user friendly and >>>>> assume the (minor) maintenance cost >>>>> >>>>> or >>>>> >>>>> [2] Deprecate `readAll` from file based IOs and encourage users to use >>>>> FileIO + `readFiles` (less maintenance and encourage composition). >>>>> >>>>> I just checked quickly in the python code base but I did not find if >>>>> the File match + ReadFiles pattern applies, but it would be nice to >>>>> see what the python guys think on this too. >>>>> >>>>> This discussion comes from a recent slack conversation with Łukasz >>>>> Gajowy, and we wanted to settle into one approach to make the IO >>>>> signatures consistent, so any opinions/preferences?