So is the latter is intended for splittable DoFn but not yet using it? The
promise of SDF is precisely this composability, isn't it?

Kenn

On Wed, Jan 30, 2019 at 10:16 AM Jeff Klukas <jklu...@mozilla.com> wrote:

> Reuven - Is TextIO.read().from() a more complex case than the topic Ismaël
> is bringing up in this thread? I'm surprised to hear that the two examples
> have different performance characteristics.
>
> Reading through the implementation, I guess the fundamental difference is
> whether a given configuration expands to TextIO.ReadAll or to io.Read.
> AFAICT, that detail and the subsequent performance impact is not
> documented.
>
> If the above is correct, perhaps it's an argument for IOs to provide
> higher-level methods in cases where they can optimize performance compared
> to what a user might naively put together.
>
> On Wed, Jan 30, 2019 at 12:35 PM Reuven Lax <re...@google.com> wrote:
>
>> Jeff, what you did here is not simply a refactoring. These two are quite
>> different, and will likely have different performance characteristics.
>>
>> The first evaluates the wildcard, and allows the runner to pick
>> appropriate bundling. Bundles might contain multiple files (if they are
>> small), and the runner can split the files as appropriate. In the case of
>> the Dataflow runner, these bundles can be further split dynamically.
>>
>> The second chops of the files inside the the PTransform, and processes
>> each chunk in a ParDo. TextIO.readFiles currently chops up each file into
>> 64mb chunks (hardcoded), and then processes each chunk in a ParDo.
>>
>> Reuven
>>
>>
>> On Wed, Jan 30, 2019 at 9:18 AM Jeff Klukas <jklu...@mozilla.com> wrote:
>>
>>> I would prefer we move towards option [2]. I just tried the following
>>> refactor in my own code from:
>>>
>>>       return input
>>>           .apply(TextIO.read().from(fileSpec));
>>>
>>> to:
>>>
>>>       return input
>>>           .apply(FileIO.match().filepattern(fileSpec))
>>>           .apply(FileIO.readMatches())
>>>           .apply(TextIO.readFiles());
>>>
>>> Yes, the latter is more verbose but not ridiculously so, and it's also
>>> more instructive about what's happening.
>>>
>>> When I first started working with Beam, it took me a while to realize
>>> that TextIO.read().from() would accept a wildcard. The more verbose version
>>> involves a method called "filepattern" which makes this much more obvious.
>>> It also leads me to understand that I could use the same FileIO.match()
>>> machinery to do other things with filesystems other than read file
>>> contents.
>>>
>>> On Wed, Jan 30, 2019 at 11:26 AM Ismaël Mejía <ieme...@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> A ‘recent’ pattern of use in Beam is to have in file based IOs a
>>>> `readAll()` implementation that basically matches a `PCollection` of
>>>> file patterns and reads them, e.g. `TextIO`, `AvroIO`. `ReadAll` is
>>>> implemented by a expand function that matches files with FileIO and
>>>> then reads them using a format specific `ReadFiles` transform e.g.
>>>> TextIO.ReadFiles, AvroIO.ReadFiles. So in the end `ReadAll` in the
>>>> Java implementation is just an user friendly API to hide FileIO.match
>>>> + ReadFiles.
>>>>
>>>> Most recent IOs do NOT implement ReadAll to encourage the more
>>>> composable approach of File + ReadFiles, e.g. XmlIO and ParquetIO.
>>>>
>>>> Implementing ReadAll as a wrapper is relatively easy and is definitely
>>>> user friendly, but it has an  issue, it may be error-prone and it adds
>>>> more code to maintain (mostly ‘repeated’ code). However `readAll` is a
>>>> more abstract pattern that applies not only to File based IOs so it
>>>> makes sense for example in other transforms that map a `Pcollection`
>>>> of read requests and is the basis for SDF composable style APIs like
>>>> the recent `HBaseIO.readAll()`.
>>>>
>>>> So the question is should we:
>>>>
>>>> [1] Implement `readAll` in all file based IOs to be user friendly and
>>>> assume the (minor) maintenance cost
>>>>
>>>> or
>>>>
>>>> [2] Deprecate `readAll` from file based IOs and encourage users to use
>>>> FileIO + `readFiles` (less maintenance and encourage composition).
>>>>
>>>> I just checked quickly in the python code base but I did not find if
>>>> the File match + ReadFiles pattern applies, but it would be nice to
>>>> see what the python guys think on this too.
>>>>
>>>> This discussion comes from a recent slack conversation with Łukasz
>>>> Gajowy, and we wanted to settle into one approach to make the IO
>>>> signatures consistent, so any opinions/preferences?
>>>>
>>>

Reply via email to