I would prefer we move towards option [2]. I just tried the following
refactor in my own code from:

      return input
          .apply(TextIO.read().from(fileSpec));

to:

      return input
          .apply(FileIO.match().filepattern(fileSpec))
          .apply(FileIO.readMatches())
          .apply(TextIO.readFiles());

Yes, the latter is more verbose but not ridiculously so, and it's also more
instructive about what's happening.

When I first started working with Beam, it took me a while to realize that
TextIO.read().from() would accept a wildcard. The more verbose version
involves a method called "filepattern" which makes this much more obvious.
It also leads me to understand that I could use the same FileIO.match()
machinery to do other things with filesystems other than read file
contents.

On Wed, Jan 30, 2019 at 11:26 AM Ismaël Mejía <[email protected]> wrote:

> Hello,
>
> A ‘recent’ pattern of use in Beam is to have in file based IOs a
> `readAll()` implementation that basically matches a `PCollection` of
> file patterns and reads them, e.g. `TextIO`, `AvroIO`. `ReadAll` is
> implemented by a expand function that matches files with FileIO and
> then reads them using a format specific `ReadFiles` transform e.g.
> TextIO.ReadFiles, AvroIO.ReadFiles. So in the end `ReadAll` in the
> Java implementation is just an user friendly API to hide FileIO.match
> + ReadFiles.
>
> Most recent IOs do NOT implement ReadAll to encourage the more
> composable approach of File + ReadFiles, e.g. XmlIO and ParquetIO.
>
> Implementing ReadAll as a wrapper is relatively easy and is definitely
> user friendly, but it has an  issue, it may be error-prone and it adds
> more code to maintain (mostly ‘repeated’ code). However `readAll` is a
> more abstract pattern that applies not only to File based IOs so it
> makes sense for example in other transforms that map a `Pcollection`
> of read requests and is the basis for SDF composable style APIs like
> the recent `HBaseIO.readAll()`.
>
> So the question is should we:
>
> [1] Implement `readAll` in all file based IOs to be user friendly and
> assume the (minor) maintenance cost
>
> or
>
> [2] Deprecate `readAll` from file based IOs and encourage users to use
> FileIO + `readFiles` (less maintenance and encourage composition).
>
> I just checked quickly in the python code base but I did not find if
> the File match + ReadFiles pattern applies, but it would be nice to
> see what the python guys think on this too.
>
> This discussion comes from a recent slack conversation with Łukasz
> Gajowy, and we wanted to settle into one approach to make the IO
> signatures consistent, so any opinions/preferences?
>

Reply via email to