I want to chime in that I am also +1 to deprecating readAll, is there
anyone strongly pro readAll instead of the explicit composition?
And more important, can the Python/Go SDK align with this (deprecating
ReadAll and implementing ReadFiles)?

On Thu, Jan 31, 2019 at 12:34 AM Chamikara Jayalath
<chamik...@google.com> wrote:
>
> Thanks for the clarification Ismaël and Eugene. +1 for deprecating existing 
> FooIO.readAll() transforms in favor of FooIO.readFiles().
>
> On Wed, Jan 30, 2019 at 3:25 PM Eugene Kirpichov <kirpic...@google.com> wrote:
>>
>> TextIO.read() and AvroIO.read() indeed perform better than match() + 
>> readMatches() + readFiles(), due to DWR - so for these two in particular I 
>> would not recommend such a refactoring.
>> However, new file-based IOs that do not support DWR should only provide 
>> readFiles(). Those that do, should provide read() and readFiles(). When SDF 
>> supports DWR, then readFiles() will be enough in all cases.
>> In general there's no need for readAll() for new file-based IOs - it is 
>> always equivalent to matchAll() + readMatches() + readFiles() including 
>> performance-wise. It was included in TextIO/AvroIO before readFiles() was a 
>> thing.
>>
>> On Wed, Jan 30, 2019 at 2:41 PM Chamikara Jayalath <chamik...@google.com> 
>> wrote:
>>>
>>> On Wed, Jan 30, 2019 at 2:37 PM Chamikara Jayalath <chamik...@google.com> 
>>> wrote:
>>>>
>>>>
>>>>
>>>> On Wed, Jan 30, 2019 at 2:33 PM Ismaël Mejía <ieme...@gmail.com> wrote:
>>>>>
>>>>> Ups slight typo, in the first line of the previous email I meant read
>>>>> instead of readAll
>>>>>
>>>>> On Wed, Jan 30, 2019 at 11:32 PM Ismaël Mejía <ieme...@gmail.com> wrote:
>>>>> >
>>>>> > Reuven is right for the example, readAll at this moment may be faster
>>>>> > and also supports Dynamic Work Rebalancing (DWR), but the performance
>>>>> > of the other approach may (and must) be improved to be equal, once the
>>>>> > internal implementation of TextIO.read moves to a SDF version instead
>>>>> > of the FileBasedSource one, and once that runners support DWR through
>>>>> > SDF. Of course all of this is future work. Probably Eugene can
>>>>> > eventually chime in to give more details in practical performance in
>>>>> > his tests in Dataflow.
>>>>> >
>>>>> > Really interesting topic, but I want to bring back the discussion to
>>>>> > the subject of the thread. I think there is some confusion after
>>>>> > Jeff's example which should have been:
>>>>> >
>>>>> >       return input
>>>>> >           .apply(TextIO.readAll());
>>>>> >
>>>>> > to:
>>>>> >
>>>>> >       return input
>>>>> >           .apply(FileIO.match().filepattern(fileSpec))
>>>>> >           .apply(FileIO.readMatches())
>>>>> >           .apply(TextIO.readFiles());
>>>>> >
>>>>> > This is the question we are addressing, do we need a readAll transform
>>>>> > that replaces the 3 steps or no?
>>>>
>>>>
>>>> Ismaël, I'm not quite sure how these two are equal. readFiles() transform 
>>>> returns a PCollection of ReadableFile objects. Users are expected to read 
>>>> these files in a subsequent ParDo and produce a PCollection of proper 
>>>> type. FooIO.ReadAll() transforms on the other hand are tailored to each IO 
>>>> connector and return a PCollection of objects of type that are supported 
>>>> to be returned by that IO connector.
>>>
>>>
>>> I assume you meant FileIO.readFiles()  here. Or did you mean 
>>> TextIO.readFiles() ? If so that seems very similar to TextIO.readAll().
>>>>
>>>>
>>>>
>>>>>
>>>>> >
>>>>> > On Wed, Jan 30, 2019 at 9:03 PM Robert Bradshaw <rober...@google.com> 
>>>>> > wrote:
>>>>> > >
>>>>> > > Yes, this is precisely the goal of SDF.
>>>>> > >
>>>>> > >
>>>>> > > On Wed, Jan 30, 2019 at 8:41 PM Kenneth Knowles <k...@google.com> 
>>>>> > > wrote:
>>>>> > > >
>>>>> > > > So is the latter is intended for splittable DoFn but not yet using 
>>>>> > > > it? The promise of SDF is precisely this composability, isn't it?
>>>>> > > >
>>>>> > > > Kenn
>>>>> > > >
>>>>> > > > On Wed, Jan 30, 2019 at 10:16 AM Jeff Klukas <jklu...@mozilla.com> 
>>>>> > > > wrote:
>>>>> > > >>
>>>>> > > >> Reuven - Is TextIO.read().from() a more complex case than the 
>>>>> > > >> topic Ismaël is bringing up in this thread? I'm surprised to hear 
>>>>> > > >> that the two examples have different performance characteristics.
>>>>> > > >>
>>>>> > > >> Reading through the implementation, I guess the fundamental 
>>>>> > > >> difference is whether a given configuration expands to 
>>>>> > > >> TextIO.ReadAll or to io.Read. AFAICT, that detail and the 
>>>>> > > >> subsequent performance impact is not documented.
>>>>> > > >>
>>>>> > > >> If the above is correct, perhaps it's an argument for IOs to 
>>>>> > > >> provide higher-level methods in cases where they can optimize 
>>>>> > > >> performance compared to what a user might naively put together.
>>>>> > > >>
>>>>> > > >> On Wed, Jan 30, 2019 at 12:35 PM Reuven Lax <re...@google.com> 
>>>>> > > >> wrote:
>>>>> > > >>>
>>>>> > > >>> Jeff, what you did here is not simply a refactoring. These two 
>>>>> > > >>> are quite different, and will likely have different performance 
>>>>> > > >>> characteristics.
>>>>> > > >>>
>>>>> > > >>> The first evaluates the wildcard, and allows the runner to pick 
>>>>> > > >>> appropriate bundling. Bundles might contain multiple files (if 
>>>>> > > >>> they are small), and the runner can split the files as 
>>>>> > > >>> appropriate. In the case of the Dataflow runner, these bundles 
>>>>> > > >>> can be further split dynamically.
>>>>> > > >>>
>>>>> > > >>> The second chops of the files inside the the PTransform, and 
>>>>> > > >>> processes each chunk in a ParDo. TextIO.readFiles currently chops 
>>>>> > > >>> up each file into 64mb chunks (hardcoded), and then processes 
>>>>> > > >>> each chunk in a ParDo.
>>>>> > > >>>
>>>>> > > >>> Reuven
>>>>> > > >>>
>>>>> > > >>>
>>>>> > > >>> On Wed, Jan 30, 2019 at 9:18 AM Jeff Klukas <jklu...@mozilla.com> 
>>>>> > > >>> wrote:
>>>>> > > >>>>
>>>>> > > >>>> I would prefer we move towards option [2]. I just tried the 
>>>>> > > >>>> following refactor in my own code from:
>>>>> > > >>>>
>>>>> > > >>>>       return input
>>>>> > > >>>>           .apply(TextIO.read().from(fileSpec));
>>>>> > > >>>>
>>>>> > > >>>> to:
>>>>> > > >>>>
>>>>> > > >>>>       return input
>>>>> > > >>>>           .apply(FileIO.match().filepattern(fileSpec))
>>>>> > > >>>>           .apply(FileIO.readMatches())
>>>>> > > >>>>           .apply(TextIO.readFiles());
>>>>> > > >>>>
>>>>> > > >>>> Yes, the latter is more verbose but not ridiculously so, and 
>>>>> > > >>>> it's also more instructive about what's happening.
>>>>> > > >>>>
>>>>> > > >>>> When I first started working with Beam, it took me a while to 
>>>>> > > >>>> realize that TextIO.read().from() would accept a wildcard. The 
>>>>> > > >>>> more verbose version involves a method called "filepattern" 
>>>>> > > >>>> which makes this much more obvious. It also leads me to 
>>>>> > > >>>> understand that I could use the same FileIO.match() machinery to 
>>>>> > > >>>> do other things with filesystems other than read file contents.
>>>>> > > >>>>
>>>>> > > >>>> On Wed, Jan 30, 2019 at 11:26 AM Ismaël Mejía 
>>>>> > > >>>> <ieme...@gmail.com> wrote:
>>>>> > > >>>>>
>>>>> > > >>>>> Hello,
>>>>> > > >>>>>
>>>>> > > >>>>> A ‘recent’ pattern of use in Beam is to have in file based IOs a
>>>>> > > >>>>> `readAll()` implementation that basically matches a 
>>>>> > > >>>>> `PCollection` of
>>>>> > > >>>>> file patterns and reads them, e.g. `TextIO`, `AvroIO`. 
>>>>> > > >>>>> `ReadAll` is
>>>>> > > >>>>> implemented by a expand function that matches files with FileIO 
>>>>> > > >>>>> and
>>>>> > > >>>>> then reads them using a format specific `ReadFiles` transform 
>>>>> > > >>>>> e.g.
>>>>> > > >>>>> TextIO.ReadFiles, AvroIO.ReadFiles. So in the end `ReadAll` in 
>>>>> > > >>>>> the
>>>>> > > >>>>> Java implementation is just an user friendly API to hide 
>>>>> > > >>>>> FileIO.match
>>>>> > > >>>>> + ReadFiles.
>>>>> > > >>>>>
>>>>> > > >>>>> Most recent IOs do NOT implement ReadAll to encourage the more
>>>>> > > >>>>> composable approach of File + ReadFiles, e.g. XmlIO and 
>>>>> > > >>>>> ParquetIO.
>>>>> > > >>>>>
>>>>> > > >>>>> Implementing ReadAll as a wrapper is relatively easy and is 
>>>>> > > >>>>> definitely
>>>>> > > >>>>> user friendly, but it has an  issue, it may be error-prone and 
>>>>> > > >>>>> it adds
>>>>> > > >>>>> more code to maintain (mostly ‘repeated’ code). However 
>>>>> > > >>>>> `readAll` is a
>>>>> > > >>>>> more abstract pattern that applies not only to File based IOs 
>>>>> > > >>>>> so it
>>>>> > > >>>>> makes sense for example in other transforms that map a 
>>>>> > > >>>>> `Pcollection`
>>>>> > > >>>>> of read requests and is the basis for SDF composable style APIs 
>>>>> > > >>>>> like
>>>>> > > >>>>> the recent `HBaseIO.readAll()`.
>>>>> > > >>>>>
>>>>> > > >>>>> So the question is should we:
>>>>> > > >>>>>
>>>>> > > >>>>> [1] Implement `readAll` in all file based IOs to be user 
>>>>> > > >>>>> friendly and
>>>>> > > >>>>> assume the (minor) maintenance cost
>>>>> > > >>>>>
>>>>> > > >>>>> or
>>>>> > > >>>>>
>>>>> > > >>>>> [2] Deprecate `readAll` from file based IOs and encourage users 
>>>>> > > >>>>> to use
>>>>> > > >>>>> FileIO + `readFiles` (less maintenance and encourage 
>>>>> > > >>>>> composition).
>>>>> > > >>>>>
>>>>> > > >>>>> I just checked quickly in the python code base but I did not 
>>>>> > > >>>>> find if
>>>>> > > >>>>> the File match + ReadFiles pattern applies, but it would be 
>>>>> > > >>>>> nice to
>>>>> > > >>>>> see what the python guys think on this too.
>>>>> > > >>>>>
>>>>> > > >>>>> This discussion comes from a recent slack conversation with 
>>>>> > > >>>>> Łukasz
>>>>> > > >>>>> Gajowy, and we wanted to settle into one approach to make the IO
>>>>> > > >>>>> signatures consistent, so any opinions/preferences?

Reply via email to