Re: Support for avro when writing to BQ using the streaming write API

Steve Niemitz Thu, 05 Aug 2021 09:38:04 -0700

Yeah, it was supported for FILE_LOADS only, which was the only thing we
ever needed to use :) (streaming inserts are too expensive and unreliable
to be useful)

Avro (GenericRecord) -> Row does use an intermediate object (it builds a
Row), it doesn't wrap the record like SpecificRecord does sadly.  I suppose
this is something that could also be optimized though.  Also, in most cases
we do need custom translation, simply wrapping objects as-is in a Row
doesn't work that well generally.  Particularly, support for Maps and Enums
needs some special handling when going to BQ since they don't have direct
representations there.  I was just looking at the Row -> PB code and it
just throws an exception when it encounters a map for example.

I do think I agree that standardizing everything around beam Row is
probably the right way to go though, although it'll require a good chunk of
work on our part to make it work as well as avro.  It's unfortunate that
each insert method takes a different input format (json, avro, proto).

I'll think about this a little more and see what I come up with, I was
mainly just curious if this was being worked on already so I didn't
duplicate work.

On Thu, Aug 5, 2021 at 11:54 AM Reuven Lax <[email protected]> wrote:

> Honestly withAvroFormatFunction was something that was hacked in, and has
> never been supported very well :( I believe that it doesn't work with the
> old streaming inserts either.
>
> I think there are a couple of options here:
>     - Beam can infer a schema from Avro (and therefore a beam Row), so you
> can to use this to present a Beam Row
>     - If your input type T is something that admits a schema (pojo,
> javabean, avro, proto, autovalue, etc.) then you can annotate it as such
> and the sink should automatically support it.
>
> Someone could of course add support directly for avro. I don't think it
> would that difficult (though it mighty be tedious), however I'd prefer to
> formulate things as schemas as much as possible. Note: intermediate
> "hopping through" a beam Row adds little overhead when it's implemented
> right. e.g. for pojos or javabeans, the beam Row wrapper doesn't copy the
> data, rather we generate bytecode inside the row to directly access the
> data in the original java object. We do the same when wrapping protobufs
> and AuttoValue, though I'm not entirely sure about Avro.
>
>
> On Thu, Aug 5, 2021 at 6:42 AM Steve Niemitz <[email protected]> wrote:
>
>> We currently have a lot of jobs that write to BQ using avro (using
>> withAvroFormatFunction), and would like to start experimenting with the
>> streaming write API.
>>
>> I see two options for how to do this:
>> - Implement support for avro -> protobuf like TableRow and beam Row do
>> - Add something like withProtoFormatFunction to go from T ->
>> DynamicMessage
>>
>> The second option seems more efficient since it'd avoid an intermediate
>> hop through avro (or TableRow or beam Row), but option 1 would mean we can
>> use our code as-is with the new streaming write API.
>>
>> Are there any plans for implementing anything like this already in the
>> works?
>>
>> As a related aside, it's becoming INCREDIBLY complicated to figure out
>> which combination of settings is supported for each write method, I usually
>> need to read through the code to figure it out.  Has there been any thought
>> given to improving the UX here?  Maybe a builder pattern where each write
>> method has its own sub-builder?  Something like:
>>
>> BigQueryIO.write().withFileLoads().withAvroFormatFunction()
>> ?
>>
>

Re: Support for avro when writing to BQ using the streaming write API

Reply via email to