I put up a semi-WIP pull request https://github.com/apache/beam/pull/9665 for this. The initial results look good. I'll spend some time soon adding unit tests and documentation, but I'd appreciate it if someone could take a first pass over it.
On Wed, Sep 18, 2019 at 6:14 PM Pablo Estrada <pabl...@google.com> wrote: > Thanks for offering to work on this! It would be awesome to have it. I can > say that we don't have that for Python ATM. > > On Mon, Sep 16, 2019 at 10:56 AM Steve Niemitz <sniem...@apache.org> > wrote: > >> Our experience has actually been that avro is more efficient than even >> parquet, but that might also be skewed from our datasets. >> >> I might try to take a crack at this, I found >> https://issues.apache.org/jira/browse/BEAM-2879 tracking it (which >> coincidentally references my thread from a couple years ago on the read >> side of this :) ). >> >> On Mon, Sep 16, 2019 at 1:38 PM Reuven Lax <re...@google.com> wrote: >> >>> It's been talked about, but nobody's done anything. There as some >>> difficulties related to type conversion (json and avro don't support the >>> same types), but if those are overcome then an avro version would be much >>> more efficient. I believe Parquet files would be even more efficient if you >>> wanted to go that path, but there might be more code to write (as we >>> already have some code in the codebase to convert between TableRows and >>> Avro). >>> >>> Reuven >>> >>> On Mon, Sep 16, 2019 at 10:33 AM Steve Niemitz <sniem...@apache.org> >>> wrote: >>> >>>> Has anyone investigated using avro rather than json to load data into >>>> BigQuery using BigQueryIO (+ FILE_LOADS)? >>>> >>>> I'd be interested in enhancing it to support this, but I'm curious if >>>> there's any prior work here. >>>> >>>