Re: [EXT] Re: using avro instead of json for BigQueryIO.Write
This is great. I'll take a look today. On Wed, Feb 26, 2020 at 9:42 AM Chuck Yang wrote: > Hi Devs, > > I was able to get around to working on Avro file loads to BigQuery in > Python SDK and now have a PR available at > https://github.com/apache/beam/pull/10979 . Comments appreciated :) > > Thanks, > Chuck > > On Wed, Nov 27, 2019 at 10:10 AM Chuck Yang > wrote: > > > > I would love to fix this, but not sure if I have the bandwidth at the > > moment. Anyway, created the jira here: > > https://jira.apache.org/jira/browse/BEAM-8841 > > > > Thanks! > > Chuck > > -- > > > *Confidentiality Note:* We care about protecting our proprietary > information, confidential material, and trade secrets. This message may > contain some or all of those things. Cruise will suffer material harm if > anyone other than the intended recipient disseminates or takes any action > based on this message. If you have received this message (including any > attachments) in error, please delete it immediately and notify the sender > promptly. >
Re: [EXT] Re: using avro instead of json for BigQueryIO.Write
Hi Devs, I was able to get around to working on Avro file loads to BigQuery in Python SDK and now have a PR available at https://github.com/apache/beam/pull/10979 . Comments appreciated :) Thanks, Chuck On Wed, Nov 27, 2019 at 10:10 AM Chuck Yang wrote: > > I would love to fix this, but not sure if I have the bandwidth at the > moment. Anyway, created the jira here: > https://jira.apache.org/jira/browse/BEAM-8841 > > Thanks! > Chuck -- *Confidentiality Note:* We care about protecting our proprietary information, confidential material, and trade secrets. This message may contain some or all of those things. Cruise will suffer material harm if anyone other than the intended recipient disseminates or takes any action based on this message. If you have received this message (including any attachments) in error, please delete it immediately and notify the sender promptly.
Re: [EXT] Re: using avro instead of json for BigQueryIO.Write
I would love to fix this, but not sure if I have the bandwidth at the moment. Anyway, created the jira here: https://jira.apache.org/jira/browse/BEAM-8841 Thanks! Chuck -- *Confidentiality Note:* We care about protecting our proprietary information, confidential material, and trade secrets. This message may contain some or all of those things. Cruise will suffer material harm if anyone other than the intended recipient disseminates or takes any action based on this message. If you have received this message (including any attachments) in error, please delete it immediately and notify the sender promptly.
Re: [EXT] Re: using avro instead of json for BigQueryIO.Write
I don't believe so, please create one (we can dedup if we happen to find another issue). Even better if you can contribute to fix this :) Thanks, Cham On Tue, Nov 26, 2019 at 7:07 PM Chuck Yang wrote: > Has anyone looked into implementing this for the Python SDK? It would > be nice to have it if only for the ability to write float values with > NaN and infinity values. I didn't see anything in Jira, happy to > create a ticket, but wanted to ask around first. > > On Thu, Oct 17, 2019 at 12:53 PM Reuven Lax wrote: > > > > I'll take a look as well. Thanks for doing this! > > > > On Fri, Oct 4, 2019 at 9:16 PM Pablo Estrada wrote: > >> > >> Thanks Steve! > >> I'll take a look next week. Sorry about the delay so far. > >> Best > >> -P. > >> > >> On Fri, Sep 27, 2019 at 10:37 AM Steve Niemitz > wrote: > >>> > >>> I put up a semi-WIP pull request > https://github.com/apache/beam/pull/9665 for this. The initial results > look good. I'll spend some time soon adding unit tests and documentation, > but I'd appreciate it if someone could take a first pass over it. > >>> > >>> On Wed, Sep 18, 2019 at 6:14 PM Pablo Estrada > wrote: > > Thanks for offering to work on this! It would be awesome to have it. > I can say that we don't have that for Python ATM. > > On Mon, Sep 16, 2019 at 10:56 AM Steve Niemitz > wrote: > > > > Our experience has actually been that avro is more efficient than > even parquet, but that might also be skewed from our datasets. > > > > I might try to take a crack at this, I found > https://issues.apache.org/jira/browse/BEAM-2879 tracking it (which > coincidentally references my thread from a couple years ago on the read > side of this :) ). > > > > On Mon, Sep 16, 2019 at 1:38 PM Reuven Lax wrote: > >> > >> It's been talked about, but nobody's done anything. There as some > difficulties related to type conversion (json and avro don't support the > same types), but if those are overcome then an avro version would be much > more efficient. I believe Parquet files would be even more efficient if you > wanted to go that path, but there might be more code to write (as we > already have some code in the codebase to convert between TableRows and > Avro). > >> > >> Reuven > >> > >> On Mon, Sep 16, 2019 at 10:33 AM Steve Niemitz > wrote: > >>> > >>> Has anyone investigated using avro rather than json to load data > into BigQuery using BigQueryIO (+ FILE_LOADS)? > >>> > >>> I'd be interested in enhancing it to support this, but I'm curious > if there's any prior work here. > > -- > > > *Confidentiality Note:* We care about protecting our proprietary > information, confidential material, and trade secrets. This message may > contain some or all of those things. Cruise will suffer material harm if > anyone other than the intended recipient disseminates or takes any action > based on this message. If you have received this message (including any > attachments) in error, please delete it immediately and notify the sender > promptly. >
Re: using avro instead of json for BigQueryIO.Write
I'll take a look as well. Thanks for doing this! On Fri, Oct 4, 2019 at 9:16 PM Pablo Estrada wrote: > Thanks Steve! > I'll take a look next week. Sorry about the delay so far. > Best > -P. > > On Fri, Sep 27, 2019 at 10:37 AM Steve Niemitz > wrote: > >> I put up a semi-WIP pull request https://github.com/apache/beam/pull/9665 for >> this. The initial results look good. I'll spend some time soon adding >> unit tests and documentation, but I'd appreciate it if someone could take a >> first pass over it. >> >> On Wed, Sep 18, 2019 at 6:14 PM Pablo Estrada wrote: >> >>> Thanks for offering to work on this! It would be awesome to have it. I >>> can say that we don't have that for Python ATM. >>> >>> On Mon, Sep 16, 2019 at 10:56 AM Steve Niemitz >>> wrote: >>> Our experience has actually been that avro is more efficient than even parquet, but that might also be skewed from our datasets. I might try to take a crack at this, I found https://issues.apache.org/jira/browse/BEAM-2879 tracking it (which coincidentally references my thread from a couple years ago on the read side of this :) ). On Mon, Sep 16, 2019 at 1:38 PM Reuven Lax wrote: > It's been talked about, but nobody's done anything. There as some > difficulties related to type conversion (json and avro don't support the > same types), but if those are overcome then an avro version would be much > more efficient. I believe Parquet files would be even more efficient if > you > wanted to go that path, but there might be more code to write (as we > already have some code in the codebase to convert between TableRows and > Avro). > > Reuven > > On Mon, Sep 16, 2019 at 10:33 AM Steve Niemitz > wrote: > >> Has anyone investigated using avro rather than json to load data into >> BigQuery using BigQueryIO (+ FILE_LOADS)? >> >> I'd be interested in enhancing it to support this, but I'm curious if >> there's any prior work here. >> >
Re: using avro instead of json for BigQueryIO.Write
Thanks Steve! I'll take a look next week. Sorry about the delay so far. Best -P. On Fri, Sep 27, 2019 at 10:37 AM Steve Niemitz wrote: > I put up a semi-WIP pull request https://github.com/apache/beam/pull/9665 for > this. The initial results look good. I'll spend some time soon adding > unit tests and documentation, but I'd appreciate it if someone could take a > first pass over it. > > On Wed, Sep 18, 2019 at 6:14 PM Pablo Estrada wrote: > >> Thanks for offering to work on this! It would be awesome to have it. I >> can say that we don't have that for Python ATM. >> >> On Mon, Sep 16, 2019 at 10:56 AM Steve Niemitz >> wrote: >> >>> Our experience has actually been that avro is more efficient than even >>> parquet, but that might also be skewed from our datasets. >>> >>> I might try to take a crack at this, I found >>> https://issues.apache.org/jira/browse/BEAM-2879 tracking it (which >>> coincidentally references my thread from a couple years ago on the read >>> side of this :) ). >>> >>> On Mon, Sep 16, 2019 at 1:38 PM Reuven Lax wrote: >>> It's been talked about, but nobody's done anything. There as some difficulties related to type conversion (json and avro don't support the same types), but if those are overcome then an avro version would be much more efficient. I believe Parquet files would be even more efficient if you wanted to go that path, but there might be more code to write (as we already have some code in the codebase to convert between TableRows and Avro). Reuven On Mon, Sep 16, 2019 at 10:33 AM Steve Niemitz wrote: > Has anyone investigated using avro rather than json to load data into > BigQuery using BigQueryIO (+ FILE_LOADS)? > > I'd be interested in enhancing it to support this, but I'm curious if > there's any prior work here. >
Re: using avro instead of json for BigQueryIO.Write
I put up a semi-WIP pull request https://github.com/apache/beam/pull/9665 for this. The initial results look good. I'll spend some time soon adding unit tests and documentation, but I'd appreciate it if someone could take a first pass over it. On Wed, Sep 18, 2019 at 6:14 PM Pablo Estrada wrote: > Thanks for offering to work on this! It would be awesome to have it. I can > say that we don't have that for Python ATM. > > On Mon, Sep 16, 2019 at 10:56 AM Steve Niemitz > wrote: > >> Our experience has actually been that avro is more efficient than even >> parquet, but that might also be skewed from our datasets. >> >> I might try to take a crack at this, I found >> https://issues.apache.org/jira/browse/BEAM-2879 tracking it (which >> coincidentally references my thread from a couple years ago on the read >> side of this :) ). >> >> On Mon, Sep 16, 2019 at 1:38 PM Reuven Lax wrote: >> >>> It's been talked about, but nobody's done anything. There as some >>> difficulties related to type conversion (json and avro don't support the >>> same types), but if those are overcome then an avro version would be much >>> more efficient. I believe Parquet files would be even more efficient if you >>> wanted to go that path, but there might be more code to write (as we >>> already have some code in the codebase to convert between TableRows and >>> Avro). >>> >>> Reuven >>> >>> On Mon, Sep 16, 2019 at 10:33 AM Steve Niemitz >>> wrote: >>> Has anyone investigated using avro rather than json to load data into BigQuery using BigQueryIO (+ FILE_LOADS)? I'd be interested in enhancing it to support this, but I'm curious if there's any prior work here. >>>
Re: using avro instead of json for BigQueryIO.Write
Thanks for offering to work on this! It would be awesome to have it. I can say that we don't have that for Python ATM. On Mon, Sep 16, 2019 at 10:56 AM Steve Niemitz wrote: > Our experience has actually been that avro is more efficient than even > parquet, but that might also be skewed from our datasets. > > I might try to take a crack at this, I found > https://issues.apache.org/jira/browse/BEAM-2879 tracking it (which > coincidentally references my thread from a couple years ago on the read > side of this :) ). > > On Mon, Sep 16, 2019 at 1:38 PM Reuven Lax wrote: > >> It's been talked about, but nobody's done anything. There as some >> difficulties related to type conversion (json and avro don't support the >> same types), but if those are overcome then an avro version would be much >> more efficient. I believe Parquet files would be even more efficient if you >> wanted to go that path, but there might be more code to write (as we >> already have some code in the codebase to convert between TableRows and >> Avro). >> >> Reuven >> >> On Mon, Sep 16, 2019 at 10:33 AM Steve Niemitz >> wrote: >> >>> Has anyone investigated using avro rather than json to load data into >>> BigQuery using BigQueryIO (+ FILE_LOADS)? >>> >>> I'd be interested in enhancing it to support this, but I'm curious if >>> there's any prior work here. >>> >>
Re: using avro instead of json for BigQueryIO.Write
Our experience has actually been that avro is more efficient than even parquet, but that might also be skewed from our datasets. I might try to take a crack at this, I found https://issues.apache.org/jira/browse/BEAM-2879 tracking it (which coincidentally references my thread from a couple years ago on the read side of this :) ). On Mon, Sep 16, 2019 at 1:38 PM Reuven Lax wrote: > It's been talked about, but nobody's done anything. There as some > difficulties related to type conversion (json and avro don't support the > same types), but if those are overcome then an avro version would be much > more efficient. I believe Parquet files would be even more efficient if you > wanted to go that path, but there might be more code to write (as we > already have some code in the codebase to convert between TableRows and > Avro). > > Reuven > > On Mon, Sep 16, 2019 at 10:33 AM Steve Niemitz > wrote: > >> Has anyone investigated using avro rather than json to load data into >> BigQuery using BigQueryIO (+ FILE_LOADS)? >> >> I'd be interested in enhancing it to support this, but I'm curious if >> there's any prior work here. >> >
Re: using avro instead of json for BigQueryIO.Write
It's been talked about, but nobody's done anything. There as some difficulties related to type conversion (json and avro don't support the same types), but if those are overcome then an avro version would be much more efficient. I believe Parquet files would be even more efficient if you wanted to go that path, but there might be more code to write (as we already have some code in the codebase to convert between TableRows and Avro). Reuven On Mon, Sep 16, 2019 at 10:33 AM Steve Niemitz wrote: > Has anyone investigated using avro rather than json to load data into > BigQuery using BigQueryIO (+ FILE_LOADS)? > > I'd be interested in enhancing it to support this, but I'm curious if > there's any prior work here. >
using avro instead of json for BigQueryIO.Write
Has anyone investigated using avro rather than json to load data into BigQuery using BigQueryIO (+ FILE_LOADS)? I'd be interested in enhancing it to support this, but I'm curious if there's any prior work here.