So is the latter is intended for splittable DoFn but not yet using it? The promise of SDF is precisely this composability, isn't it?
Kenn On Wed, Jan 30, 2019 at 10:16 AM Jeff Klukas <jklu...@mozilla.com> wrote: > Reuven - Is TextIO.read().from() a more complex case than the topic Ismaël > is bringing up in this thread? I'm surprised to hear that the two examples > have different performance characteristics. > > Reading through the implementation, I guess the fundamental difference is > whether a given configuration expands to TextIO.ReadAll or to io.Read. > AFAICT, that detail and the subsequent performance impact is not > documented. > > If the above is correct, perhaps it's an argument for IOs to provide > higher-level methods in cases where they can optimize performance compared > to what a user might naively put together. > > On Wed, Jan 30, 2019 at 12:35 PM Reuven Lax <re...@google.com> wrote: > >> Jeff, what you did here is not simply a refactoring. These two are quite >> different, and will likely have different performance characteristics. >> >> The first evaluates the wildcard, and allows the runner to pick >> appropriate bundling. Bundles might contain multiple files (if they are >> small), and the runner can split the files as appropriate. In the case of >> the Dataflow runner, these bundles can be further split dynamically. >> >> The second chops of the files inside the the PTransform, and processes >> each chunk in a ParDo. TextIO.readFiles currently chops up each file into >> 64mb chunks (hardcoded), and then processes each chunk in a ParDo. >> >> Reuven >> >> >> On Wed, Jan 30, 2019 at 9:18 AM Jeff Klukas <jklu...@mozilla.com> wrote: >> >>> I would prefer we move towards option [2]. I just tried the following >>> refactor in my own code from: >>> >>> return input >>> .apply(TextIO.read().from(fileSpec)); >>> >>> to: >>> >>> return input >>> .apply(FileIO.match().filepattern(fileSpec)) >>> .apply(FileIO.readMatches()) >>> .apply(TextIO.readFiles()); >>> >>> Yes, the latter is more verbose but not ridiculously so, and it's also >>> more instructive about what's happening. >>> >>> When I first started working with Beam, it took me a while to realize >>> that TextIO.read().from() would accept a wildcard. The more verbose version >>> involves a method called "filepattern" which makes this much more obvious. >>> It also leads me to understand that I could use the same FileIO.match() >>> machinery to do other things with filesystems other than read file >>> contents. >>> >>> On Wed, Jan 30, 2019 at 11:26 AM Ismaël Mejía <ieme...@gmail.com> wrote: >>> >>>> Hello, >>>> >>>> A ‘recent’ pattern of use in Beam is to have in file based IOs a >>>> `readAll()` implementation that basically matches a `PCollection` of >>>> file patterns and reads them, e.g. `TextIO`, `AvroIO`. `ReadAll` is >>>> implemented by a expand function that matches files with FileIO and >>>> then reads them using a format specific `ReadFiles` transform e.g. >>>> TextIO.ReadFiles, AvroIO.ReadFiles. So in the end `ReadAll` in the >>>> Java implementation is just an user friendly API to hide FileIO.match >>>> + ReadFiles. >>>> >>>> Most recent IOs do NOT implement ReadAll to encourage the more >>>> composable approach of File + ReadFiles, e.g. XmlIO and ParquetIO. >>>> >>>> Implementing ReadAll as a wrapper is relatively easy and is definitely >>>> user friendly, but it has an issue, it may be error-prone and it adds >>>> more code to maintain (mostly ‘repeated’ code). However `readAll` is a >>>> more abstract pattern that applies not only to File based IOs so it >>>> makes sense for example in other transforms that map a `Pcollection` >>>> of read requests and is the basis for SDF composable style APIs like >>>> the recent `HBaseIO.readAll()`. >>>> >>>> So the question is should we: >>>> >>>> [1] Implement `readAll` in all file based IOs to be user friendly and >>>> assume the (minor) maintenance cost >>>> >>>> or >>>> >>>> [2] Deprecate `readAll` from file based IOs and encourage users to use >>>> FileIO + `readFiles` (less maintenance and encourage composition). >>>> >>>> I just checked quickly in the python code base but I did not find if >>>> the File match + ReadFiles pattern applies, but it would be nice to >>>> see what the python guys think on this too. >>>> >>>> This discussion comes from a recent slack conversation with Łukasz >>>> Gajowy, and we wanted to settle into one approach to make the IO >>>> signatures consistent, so any opinions/preferences? >>>> >>>