Yes, this is precisely the goal of SDF.

On Wed, Jan 30, 2019 at 8:41 PM Kenneth Knowles <k...@google.com> wrote:
>
> So is the latter is intended for splittable DoFn but not yet using it? The 
> promise of SDF is precisely this composability, isn't it?
>
> Kenn
>
> On Wed, Jan 30, 2019 at 10:16 AM Jeff Klukas <jklu...@mozilla.com> wrote:
>>
>> Reuven - Is TextIO.read().from() a more complex case than the topic Ismaël 
>> is bringing up in this thread? I'm surprised to hear that the two examples 
>> have different performance characteristics.
>>
>> Reading through the implementation, I guess the fundamental difference is 
>> whether a given configuration expands to TextIO.ReadAll or to io.Read. 
>> AFAICT, that detail and the subsequent performance impact is not documented.
>>
>> If the above is correct, perhaps it's an argument for IOs to provide 
>> higher-level methods in cases where they can optimize performance compared 
>> to what a user might naively put together.
>>
>> On Wed, Jan 30, 2019 at 12:35 PM Reuven Lax <re...@google.com> wrote:
>>>
>>> Jeff, what you did here is not simply a refactoring. These two are quite 
>>> different, and will likely have different performance characteristics.
>>>
>>> The first evaluates the wildcard, and allows the runner to pick appropriate 
>>> bundling. Bundles might contain multiple files (if they are small), and the 
>>> runner can split the files as appropriate. In the case of the Dataflow 
>>> runner, these bundles can be further split dynamically.
>>>
>>> The second chops of the files inside the the PTransform, and processes each 
>>> chunk in a ParDo. TextIO.readFiles currently chops up each file into 64mb 
>>> chunks (hardcoded), and then processes each chunk in a ParDo.
>>>
>>> Reuven
>>>
>>>
>>> On Wed, Jan 30, 2019 at 9:18 AM Jeff Klukas <jklu...@mozilla.com> wrote:
>>>>
>>>> I would prefer we move towards option [2]. I just tried the following 
>>>> refactor in my own code from:
>>>>
>>>>       return input
>>>>           .apply(TextIO.read().from(fileSpec));
>>>>
>>>> to:
>>>>
>>>>       return input
>>>>           .apply(FileIO.match().filepattern(fileSpec))
>>>>           .apply(FileIO.readMatches())
>>>>           .apply(TextIO.readFiles());
>>>>
>>>> Yes, the latter is more verbose but not ridiculously so, and it's also 
>>>> more instructive about what's happening.
>>>>
>>>> When I first started working with Beam, it took me a while to realize that 
>>>> TextIO.read().from() would accept a wildcard. The more verbose version 
>>>> involves a method called "filepattern" which makes this much more obvious. 
>>>> It also leads me to understand that I could use the same FileIO.match() 
>>>> machinery to do other things with filesystems other than read file 
>>>> contents.
>>>>
>>>> On Wed, Jan 30, 2019 at 11:26 AM Ismaël Mejía <ieme...@gmail.com> wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>> A ‘recent’ pattern of use in Beam is to have in file based IOs a
>>>>> `readAll()` implementation that basically matches a `PCollection` of
>>>>> file patterns and reads them, e.g. `TextIO`, `AvroIO`. `ReadAll` is
>>>>> implemented by a expand function that matches files with FileIO and
>>>>> then reads them using a format specific `ReadFiles` transform e.g.
>>>>> TextIO.ReadFiles, AvroIO.ReadFiles. So in the end `ReadAll` in the
>>>>> Java implementation is just an user friendly API to hide FileIO.match
>>>>> + ReadFiles.
>>>>>
>>>>> Most recent IOs do NOT implement ReadAll to encourage the more
>>>>> composable approach of File + ReadFiles, e.g. XmlIO and ParquetIO.
>>>>>
>>>>> Implementing ReadAll as a wrapper is relatively easy and is definitely
>>>>> user friendly, but it has an  issue, it may be error-prone and it adds
>>>>> more code to maintain (mostly ‘repeated’ code). However `readAll` is a
>>>>> more abstract pattern that applies not only to File based IOs so it
>>>>> makes sense for example in other transforms that map a `Pcollection`
>>>>> of read requests and is the basis for SDF composable style APIs like
>>>>> the recent `HBaseIO.readAll()`.
>>>>>
>>>>> So the question is should we:
>>>>>
>>>>> [1] Implement `readAll` in all file based IOs to be user friendly and
>>>>> assume the (minor) maintenance cost
>>>>>
>>>>> or
>>>>>
>>>>> [2] Deprecate `readAll` from file based IOs and encourage users to use
>>>>> FileIO + `readFiles` (less maintenance and encourage composition).
>>>>>
>>>>> I just checked quickly in the python code base but I did not find if
>>>>> the File match + ReadFiles pattern applies, but it would be nice to
>>>>> see what the python guys think on this too.
>>>>>
>>>>> This discussion comes from a recent slack conversation with Łukasz
>>>>> Gajowy, and we wanted to settle into one approach to make the IO
>>>>> signatures consistent, so any opinions/preferences?

Reply via email to