Re: Understanding BigQueryIO.Read performance and options

Steve Niemitz Sat, 09 Sep 2017 11:18:59 -0700

Ah that makes sense wrt splitting, but is indeed confusing!  Thanks for the
explanation. :)


wrt native types and TableRow, I understand your point, but could also
argue that the raw avro records are just as "native" to the BigQuery
connector as the TableRow JSON objects, since both are directly exposed by
BigQuery.

Maybe my use-case is more specialized, but I already have a good amount of
code that I used pre-Beam to process BigQuery avro extract files, and avro
is significantly smaller and more performant than JSON, which is why I'm
using it rather than just using TableRows.

In any case, if there's no desire for such a feature I can always replicate
the functionality of BigQueryIO in my own codebase, so it's not a big deal,
it just seems like a feature that would be useful for other people as well.

On Sat, Sep 9, 2017 at 1:55 PM, Reuven Lax <[email protected]> wrote:

> On Sat, Sep 9, 2017 at 10:53 AM, Eugene Kirpichov <
> [email protected]> wrote:
>
> > This is a bit confusing - BigQueryQuerySource and BigQueryTableSource
> > indeed use the REST API to read rows if you read them unsplit - however,
> in
> > split() they run extract jobs and produce a bunch of Avro sources that
> are
> > read in parallel. I'm not sure we have any use cases for reading them
> > unsplit (except unit tests) - perhaps that code path can be removed?
> >
>
> I believe split() will always be called in production. Maybe not in unit
> tests?
>
>
> >
> > About outputting non-TableRow: per
> > https://beam.apache.org/contribute/ptransform-style-
> > guide/#choosing-types-of-input-and-output-pcollections,
> > it is recommended to output the native type of the connector, unless it's
> > impossible to provide a coder for it. This is the case for
> > AvroIO.parseGenericRecords, but it's not the case for TableRow, so I
> would
> > recommend against it: you can always map a TableRow to something else
> using
> > MapElements.
> >
> > On Sat, Sep 9, 2017 at 10:37 AM Reuven Lax <[email protected]>
> > wrote:
> >
> > > Hi Steve,
> > >
> > > The BigQuery source should always uses extract jobs, regardless of
> > > withTemplateCompatibility. What makes you think otherwise?
> > >
> > > Reuven
> > >
> > >
> > > On Sat, Sep 9, 2017 at 9:35 AM, Steve Niemitz <[email protected]>
> > wrote:
> > >
> > > > Hello!
> > > >
> > > > Until now I've been using a custom-built alternative to
> BigQueryIO.Read
> > > > that manually runs a BigQuery extract job (to avro), then uses
> > > > AvroIO.parseGenericRecords() to read the output.
> > > >
> > > > I'm investigating instead enhancing the actual BigQueryIO.Read to
> allow
> > > > something similar, since it appears a good amount of the plumbing is
> > > > already in place to do this.  However I'm confused at some of the
> > > > implementation details.
> > > >
> > > > To start, it seems like there's two different read paths:
> > > > - If "withTemplateCompatibility" is set, a similar method to what I
> > > > described above is used; an extract job is started to export to avro,
> > and
> > > > AvroSource is used to read files and transform them into TableRows.
> > > >
> > > > - However, if not set, the BigQueryReader class simply uses the REST
> > API
> > > to
> > > > read rows from the tables.  This method, I've seen in practice, has
> > some
> > > > significant performance limitations.
> > > >
> > > > It seems to me that for large tables, I'd always want to use the
> first
> > > > method, however I'm not sure why the implementation is tied to the
> > oddly
> > > > named "withTemplateCompatibility" option.  Does anyone have insight
> as
> > to
> > > > the implementation details here?
> > > >
> > > > Additionally, would the community in general be accepting to
> > enhancements
> > > > to BigQueryIO to allow the final output to be something other than
> > > > "TableRow" instances, similar to how AvroIO.parseGenericRecords
> takes a
> > > > parseFn?
> > > >
> > > > Thanks!
> > > >
> > >
> >
>

Re: Understanding BigQueryIO.Read performance and options

Reply via email to