Yeah, it was supported for FILE_LOADS only, which was the only thing we ever needed to use :) (streaming inserts are too expensive and unreliable to be useful)
Avro (GenericRecord) -> Row does use an intermediate object (it builds a Row), it doesn't wrap the record like SpecificRecord does sadly. I suppose this is something that could also be optimized though. Also, in most cases we do need custom translation, simply wrapping objects as-is in a Row doesn't work that well generally. Particularly, support for Maps and Enums needs some special handling when going to BQ since they don't have direct representations there. I was just looking at the Row -> PB code and it just throws an exception when it encounters a map for example. I do think I agree that standardizing everything around beam Row is probably the right way to go though, although it'll require a good chunk of work on our part to make it work as well as avro. It's unfortunate that each insert method takes a different input format (json, avro, proto). I'll think about this a little more and see what I come up with, I was mainly just curious if this was being worked on already so I didn't duplicate work. On Thu, Aug 5, 2021 at 11:54 AM Reuven Lax <[email protected]> wrote: > Honestly withAvroFormatFunction was something that was hacked in, and has > never been supported very well :( I believe that it doesn't work with the > old streaming inserts either. > > I think there are a couple of options here: > - Beam can infer a schema from Avro (and therefore a beam Row), so you > can to use this to present a Beam Row > - If your input type T is something that admits a schema (pojo, > javabean, avro, proto, autovalue, etc.) then you can annotate it as such > and the sink should automatically support it. > > Someone could of course add support directly for avro. I don't think it > would that difficult (though it mighty be tedious), however I'd prefer to > formulate things as schemas as much as possible. Note: intermediate > "hopping through" a beam Row adds little overhead when it's implemented > right. e.g. for pojos or javabeans, the beam Row wrapper doesn't copy the > data, rather we generate bytecode inside the row to directly access the > data in the original java object. We do the same when wrapping protobufs > and AuttoValue, though I'm not entirely sure about Avro. > > > On Thu, Aug 5, 2021 at 6:42 AM Steve Niemitz <[email protected]> wrote: > >> We currently have a lot of jobs that write to BQ using avro (using >> withAvroFormatFunction), and would like to start experimenting with the >> streaming write API. >> >> I see two options for how to do this: >> - Implement support for avro -> protobuf like TableRow and beam Row do >> - Add something like withProtoFormatFunction to go from T -> >> DynamicMessage >> >> The second option seems more efficient since it'd avoid an intermediate >> hop through avro (or TableRow or beam Row), but option 1 would mean we can >> use our code as-is with the new streaming write API. >> >> Are there any plans for implementing anything like this already in the >> works? >> >> As a related aside, it's becoming INCREDIBLY complicated to figure out >> which combination of settings is supported for each write method, I usually >> need to read through the code to figure it out. Has there been any thought >> given to improving the UX here? Maybe a builder pattern where each write >> method has its own sub-builder? Something like: >> >> BigQueryIO.write().withFileLoads().withAvroFormatFunction() >> ? >> >
