Re: Understanding BigQueryIO.Read performance and options

Eugene Kirpichov Mon, 25 Sep 2017 08:31:01 -0700

Hi,
I actually also made a PR for this myself the other day, but forgot to
update this thread, sorry :-|
https://github.com/apache/beam/pull/3891
I also hit this conversion issue, and it is intended behavior because
TableRow is the JSON representation of the data, and it matches what
BigQuery does when exporting to JSON format: it converts numbers to STRING.
Yes, some tests were incorrect.


On Mon, Sep 25, 2017 at 7:13 AM Steve Niemitz <[email protected]> wrote:

> I've just about got something to send for a PR, I'll push it shortly.
>
> One thing I noticed though while running the BigQueryIO unit tests that I
> found strange.
> It looks like currently the TableRow / Avro / TableRowJsonCoder interaction
> changes BigQuery INTEGERs to strings in the public interface.  eg
> TableRow.get("someInteger") returns a string, not a long.  The strange
> thing is some unit tests do not mimic this behavior.  For
> example testBigQueryTableSourceInitSplit asserts that the long values exist
> in the result set, not the string versions.  On the other
> hand, testValidateReadSetsDefaultProject gets the value as a String and
> converts it to a Long.
>
> It seems like the behavior to convert to strings is intended, and the unit
> tests are incorrect?  Can anyone confirm?
>
> On Thu, Sep 14, 2017 at 5:31 PM, Steve Niemitz <[email protected]>
> wrote:
>
> > Cool, thanks for the shove in the right direction!
> >
> > I'll probably have some more time to spend on this early next week,
> > hopefully I'll have a PR to submit after that. :)
> >
> > On Thu, Sep 14, 2017 at 4:37 PM, Eugene Kirpichov <
> > [email protected]> wrote:
> >
> >> On Thu, Sep 14, 2017 at 1:10 PM Steve Niemitz <[email protected]>
> >> wrote:
> >>
> >> > I spent a little time trying to refactor it yesterday, but ran into a
> >> > couple tricky points:
> >> >
> >> > - The BigQuerySource implementation's non-split version (mentioned
> >> above)
> >> > doesn't read from avro files and this would be pretty difficult to
> >> "fake"
> >> > into GenericRecords.  It sounds like this could just be dropped
> >> completely
> >> > from what was mentioned previously though.
> >> >
> >> Agreed. This would be a very welcome contribution.
> >>
> >>
> >> >
> >> > - A user-provided SerializableFunction<GenericRecord, T> seems like
> >> the a
> >> > good route to providing an extension point (which could be passed
> >> directly
> >> > to the AvroSource used to read the files), but a) goes against the
> >> > PTransform style guide, and b) is tricky to shoehorn into the current
> >> > implementation, due to the existing tight coupling of the
> GenericRecord
> >> ->
> >> > TableRow transform and the rest of BigQuerySource.
> >> >
> >> If we create a version of BigQueryIO that goes through GenericRecord,
> then
> >> providing a SerializableFunction<GenericRecord, T> will be reasonable,
> >> and
> >> not against the PT Style Guide, because GenericRecord is not encodable
> >> unless you know the schema, which in case of reading from BigQuery you
> >> don't.
> >>
> >>
> >> >
> >> > - I feel like the ideal solution would really be to provide a coder
> for
> >> > GenericRecord, and allow anyone to hook up a MapElements to the output
> >> of a
> >> > BigQueryIO that produces them.  However, the fact that GenericRecord
> >> > bundles the avro schema along with the object means every record
> >> serialized
> >> > would need to also include the full schema.  I was playing around with
> >> ways
> >> > to possibly memoize the schema so it doesn't need to be serialized for
> >> each
> >> > record, but I'm not familiar enough with the guarantees a Coder is
> >> provided
> >> > to know if its safe to do.
> >> >
> >> I suggest to not spend time looking in this direction - I think it's
> >> impossible to provide a Coder for GenericRecord with the current Coder
> API
> >> [changing the Coder API may be possible, but it would be a huge
> >> undertaking]. A SerializableFunction is a reasonable way to go.
> >>
> >>
> >> >
> >> > I hope to have something concrete soon as an example, but in the mean
> >> time
> >> > I'm interested to hear other people's thoughts on the above.
> >> >
> >> >
> >> > On Thu, Sep 14, 2017 at 3:29 PM, Eugene Kirpichov <
> >> > [email protected]> wrote:
> >> >
> >> > > Oh, I see what you mean. Yeah, I agree that having BigQueryIO use
> >> > TableRow
> >> > > as the native format was a suboptimal decision in retrospect, and I
> >> agree
> >> > > that it would be reasonable to provide ability to go through Avro
> >> > > GenericRecord instead. I'm just not sure how to provide it in an
> >> > > API-compatible way - that would be particularly challenging since
> >> > > BigQueryIO is a beast in terms of amount of code and intermediate
> >> > > transforms involved. But if you have ideas, they would be welcome.
> >> > >
> >> > > On Sat, Sep 9, 2017 at 11:18 AM Steve Niemitz <[email protected]>
> >> > wrote:
> >> > >
> >> > > > Ah that makes sense wrt splitting, but is indeed confusing!
> Thanks
> >> for
> >> > > the
> >> > > > explanation. :)
> >> > > >
> >> > > > wrt native types and TableRow, I understand your point, but could
> >> also
> >> > > > argue that the raw avro records are just as "native" to the
> BigQuery
> >> > > > connector as the TableRow JSON objects, since both are directly
> >> exposed
> >> > > by
> >> > > > BigQuery.
> >> > > >
> >> > > > Maybe my use-case is more specialized, but I already have a good
> >> amount
> >> > > of
> >> > > > code that I used pre-Beam to process BigQuery avro extract files,
> >> and
> >> > > avro
> >> > > > is significantly smaller and more performant than JSON, which is
> why
> >> > I'm
> >> > > > using it rather than just using TableRows.
> >> > > >
> >> > > > In any case, if there's no desire for such a feature I can always
> >> > > replicate
> >> > > > the functionality of BigQueryIO in my own codebase, so it's not a
> >> big
> >> > > deal,
> >> > > > it just seems like a feature that would be useful for other people
> >> as
> >> > > well.
> >> > > >
> >> > > > On Sat, Sep 9, 2017 at 1:55 PM, Reuven Lax
> <[email protected]
> >> >
> >> > > > wrote:
> >> > > >
> >> > > > > On Sat, Sep 9, 2017 at 10:53 AM, Eugene Kirpichov <
> >> > > > > [email protected]> wrote:
> >> > > > >
> >> > > > > > This is a bit confusing - BigQueryQuerySource and
> >> > BigQueryTableSource
> >> > > > > > indeed use the REST API to read rows if you read them unsplit
> -
> >> > > > however,
> >> > > > > in
> >> > > > > > split() they run extract jobs and produce a bunch of Avro
> >> sources
> >> > > that
> >> > > > > are
> >> > > > > > read in parallel. I'm not sure we have any use cases for
> reading
> >> > them
> >> > > > > > unsplit (except unit tests) - perhaps that code path can be
> >> > removed?
> >> > > > > >
> >> > > > >
> >> > > > > I believe split() will always be called in production. Maybe not
> >> in
> >> > > unit
> >> > > > > tests?
> >> > > > >
> >> > > > >
> >> > > > > >
> >> > > > > > About outputting non-TableRow: per
> >> > > > > > https://beam.apache.org/contribute/ptransform-style-
> >> > > > > > guide/#choosing-types-of-input-and-output-pcollections,
> >> > > > > > it is recommended to output the native type of the connector,
> >> > unless
> >> > > > it's
> >> > > > > > impossible to provide a coder for it. This is the case for
> >> > > > > > AvroIO.parseGenericRecords, but it's not the case for
> TableRow,
> >> so
> >> > I
> >> > > > > would
> >> > > > > > recommend against it: you can always map a TableRow to
> something
> >> > else
> >> > > > > using
> >> > > > > > MapElements.
> >> > > > > >
> >> > > > > > On Sat, Sep 9, 2017 at 10:37 AM Reuven Lax
> >> > <[email protected]
> >> > > >
> >> > > > > > wrote:
> >> > > > > >
> >> > > > > > > Hi Steve,
> >> > > > > > >
> >> > > > > > > The BigQuery source should always uses extract jobs,
> >> regardless
> >> > of
> >> > > > > > > withTemplateCompatibility. What makes you think otherwise?
> >> > > > > > >
> >> > > > > > > Reuven
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > On Sat, Sep 9, 2017 at 9:35 AM, Steve Niemitz <
> >> > [email protected]
> >> > > >
> >> > > > > > wrote:
> >> > > > > > >
> >> > > > > > > > Hello!
> >> > > > > > > >
> >> > > > > > > > Until now I've been using a custom-built alternative to
> >> > > > > BigQueryIO.Read
> >> > > > > > > > that manually runs a BigQuery extract job (to avro), then
> >> uses
> >> > > > > > > > AvroIO.parseGenericRecords() to read the output.
> >> > > > > > > >
> >> > > > > > > > I'm investigating instead enhancing the actual
> >> BigQueryIO.Read
> >> > to
> >> > > > > allow
> >> > > > > > > > something similar, since it appears a good amount of the
> >> > plumbing
> >> > > > is
> >> > > > > > > > already in place to do this.  However I'm confused at some
> >> of
> >> > the
> >> > > > > > > > implementation details.
> >> > > > > > > >
> >> > > > > > > > To start, it seems like there's two different read paths:
> >> > > > > > > > - If "withTemplateCompatibility" is set, a similar method
> to
> >> > > what I
> >> > > > > > > > described above is used; an extract job is started to
> >> export to
> >> > > > avro,
> >> > > > > > and
> >> > > > > > > > AvroSource is used to read files and transform them into
> >> > > TableRows.
> >> > > > > > > >
> >> > > > > > > > - However, if not set, the BigQueryReader class simply
> uses
> >> the
> >> > > > REST
> >> > > > > > API
> >> > > > > > > to
> >> > > > > > > > read rows from the tables.  This method, I've seen in
> >> practice,
> >> > > has
> >> > > > > > some
> >> > > > > > > > significant performance limitations.
> >> > > > > > > >
> >> > > > > > > > It seems to me that for large tables, I'd always want to
> use
> >> > the
> >> > > > > first
> >> > > > > > > > method, however I'm not sure why the implementation is
> tied
> >> to
> >> > > the
> >> > > > > > oddly
> >> > > > > > > > named "withTemplateCompatibility" option.  Does anyone
> have
> >> > > insight
> >> > > > > as
> >> > > > > > to
> >> > > > > > > > the implementation details here?
> >> > > > > > > >
> >> > > > > > > > Additionally, would the community in general be accepting
> to
> >> > > > > > enhancements
> >> > > > > > > > to BigQueryIO to allow the final output to be something
> >> other
> >> > > than
> >> > > > > > > > "TableRow" instances, similar to how
> >> AvroIO.parseGenericRecords
> >> > > > > takes a
> >> > > > > > > > parseFn?
> >> > > > > > > >
> >> > > > > > > > Thanks!
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: Understanding BigQueryIO.Read performance and options

Reply via email to