By the way, does anyone know what is the status of BigQuery connector in Beam Go and Beam SQL? Perhaps some folks working on these SDKs can chime in here. I am curious whether these SDKs also make / will make it a responsibility of the user to base64-encode bytes. As I mentioned above, it is desirable to have a consistent UX across SDK, especially given that we are working on adding support for cross-language pipelines ( https://beam.apache.org/roadmap/connectors-multi-sdk/).
On Wed, May 15, 2019 at 12:26 PM Valentyn Tymofieiev <[email protected]> wrote: > I took a closer look at BigQuery IO implementation in Beam SDK and > Dataflow runner while reviewing a few PRs to address BEAM-6769, and I think > we have to revise the course of action here. > > It turns out, that when we first added support for BYTES in Java BiqQuery > IO, we designed the API with an expectation that: > - On write path the user must pass base64-encoded bytes to the BQ IO. [0] > - On read path BQ IO base64-encodes the output result, before serving it > to the user. [1] > > When support for BigQuery was added to Python SDK and Dataflow runner, the > runner authors preserved the behavior of treating bytes to be consistent > with Java BQ IO - bytes must be base64-encoded by the user, and bytes from > BQ IO returned by Dataflow Python runner are base64-encoded. > > Unfortunately, this behavior is not documented in public documentation or > JavaDoc/PyDocs [2-4], and there were no examples illustrating it, up until > we added integration tests a few years down the road [5,6]. Thanks to these > integration tests we discovered BEAM-6769. > > I don't have context why we made a decision to avoid handling raw bytes in > Beam, however I think keeping consistent treatment of bytes across all SDKs > and runners is important for a smooth user experience, especially so when a > particular behavior is not documented well. > > This being said I suggest the following: > 1. Let's keep the current expectation that Beam operates only on > base64-encoded bytes in BQ IO. It may be reasonable to revise this > expectation, but it is beyond the scope of BEAM-6769. > 2. Let's document current behavior of BQ IO w.r.t. of handling bytes. > Chances are that if we had such documentation, we wouldn't have had to > answer questions raised in this thread. Filed BEAM-7326 to track. > 3. Let's revise Python BQ integration tests to clearly communicate that BQ > IO expects base64-encoded bytes. Filed BEAM-7327 to track. > > Coming back to the original message: > > When writing b’abc’ in python 2 this results in actually writing b'i\xb7' >> which is the same as base64.b64decode('abc=')) > > This is expected as Beam BQ IO expect users to base64-encode their bytes. > >> When writing b’abc’ in python 3 this results in “TypeError: b'abc' is not >> JSON serializable” > > This is a Py3-compatibility bug. We should decode bytes to a str on Python > 3. Given that we expect input to be base64-encoded, we can using 'ascii' > codec. > >> When writing b’\xab’ in py2/py3 this gives a “ValueError: 'utf8' codec >> can't decode byte 0xab in position 0: invalid start byte. NAN, INF and -INF >> values are not JSON compliant > > This expected since b’\xab’ cannot be base64 decoded. > >> When reading bytes from BQ they are currently returned as base-64 encoded >> strings rather then the raw bytes. > > This is also expected. > > [0] > https://github.com/apache/beam/commit/c7e0010b0d4a3c45148d05f5101f5310bb84c40c#diff-1016cd1e3092d30556292ab7b983c4c8R103 > > [1] > https://github.com/apache/beam/commit/c7e0010b0d4a3c45148d05f5101f5310bb84c40c#diff-44025ee9b9c94123967e1df92bfb1c04R207 > [2] https://beam.apache.org/documentation/io/built-in/google-bigquery/ > [3] > https://beam.apache.org/releases/pydoc/2.12.0/apache_beam.io.gcp.bigquery.html > [4] > https://beam.apache.org/releases/javadoc/2.12.0/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.html > [5] > https://github.com/apache/beam/blob/7b1abc923183a9f6336d3d44681b8fcd8785104c/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryToTableIT.java#L92 > > [6] > https://github.com/apache/beam/commit/d6b456dd922655b216b2c5af6548b0f5fe4eb507#diff-7f1bb65cbe782f5a27c5a75b6fe89fbcR112 > > > On Tue, Mar 26, 2019 at 11:27 AM Pablo Estrada <[email protected]> wrote: > >> Sure, we can make users explicitly ask for schema autodetection, instead >> of it being the default when no schema is provided. I think that's >> reasonable. >> >> >> On Mon, Mar 25, 2019, 7:19 PM Valentyn Tymofieiev <[email protected]> >> wrote: >> >>> Thanks everyone for input on this thread. I think there is a confusion >>> between not specifying the schema, and asking BigQuery to do schema >>> autodetection. This is not the same thing, however in recent changes to BQ >>> IO that happened after 2.11 release, we are forcing schema autodetection, >>> when schema is not specified, see: [1]. >>> >>> I think we need to revise this ahead of 2.12. It may be better if users >>> explicitly opt-in to schema autodetection if they wish. Autodetection is an >>> approximation, and in particular, as we figured out in this thread, it does >>> not work correctly for BYTES data. >>> >>> I suspect that if we disable schema autodetection, and/or make previous >>> implementation of BQ sink a default option, we will be able to write BYTES >>> data to a previously created BQ table without specifying the schema, and >>> making a call to BQ to fetch the schema won't be necessary. We'd need to >>> verify that. >>> >> >>> Another interesting note, as per Juta's analysis >>> <https://docs.google.com/document/d/19zvDycWzF82MmtCmxrhqqyXKaRq8slRIjdxE6E8MObA/edit?usp=sharing>, >>> google-cloud-bigquery client does not require additional base64 encoding >>> for bytes, so once we migrate to use this client, base64 encoding/decoding >>> of Bytes data won't be necessary in Beam. >>> >>> [1] >>> https://github.com/apache/beam/blob/0b71f541e93f3bd69af87ad8a6db46ccb4a01ddc/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L321 >>> . >>> [2] >>> https://docs.google.com/document/d/19zvDycWzF82MmtCmxrhqqyXKaRq8slRIjdxE6E8MObA/edit#bookmark=id.7pfrsz1c8hcj >>> >>> On Mon, Mar 25, 2019 at 2:26 PM Chamikara Jayalath <[email protected]> >>> wrote: >>> >>>> >>>> >>>> On Mon, Mar 25, 2019 at 2:16 PM Pablo Estrada <[email protected]> >>>> wrote: >>>> >>>>> +Chamikara Jayalath <[email protected]> with the new BigQuery >>>>> sink, schema autodetection is supported (it's a very simple thing to >>>>> have). >>>>> Do you think we should not have it? >>>>> Best >>>>> -P. >>>>> >>>> >>>> Ah good to know. But IMO users should be able to write to existing >>>> tables without specifying a schema (when CEATE_DISPOSITION is CREATE_NEVER >>>> for example). How do users enable schema auto-detection ? Probably this >>>> should not be enabled by default and we should clearly advertise that bytes >>>> type is not supported (or support it with extra information). Just my 2 >>>> cents. >>>> >>>> Thanks, >>>> Cham >>>> >>>> >>>>> >>>>> On Mon, Mar 25, 2019 at 11:01 AM Chamikara Jayalath < >>>>> [email protected]> wrote: >>>>> >>>>>> >>>>>> >>>>>> On Mon, Mar 25, 2019 at 2:03 AM Juta Staes <[email protected]> wrote: >>>>>> >>>>>>> >>>>>>> On Mon, 25 Mar 2019 at 06:15, Valentyn Tymofieiev < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> We received feedback on >>>>>>>> https://issuetracker.google.com/issues/129006689 - BQ developers >>>>>>>> say that schema identification is done and they discourage to use >>>>>>>> schema >>>>>>>> autodetection in tables using BYTES. In light of this, I think may be >>>>>>>> fair >>>>>>>> to recommend Beam users to specify BQ schemas as well when they >>>>>>>> interact >>>>>>>> with BQ, and call out that writing binary data to BQ will likely fail >>>>>>>> unless schema is specified. Does that make sense? >>>>>>>> >>>>>>> >>>>>>> Given that schema autodetect does not work for bytes I think it is >>>>>>> indeed a good solution to require users to specify BQ schemas as well >>>>>>> when >>>>>>> they write to BQ >>>>>>> >>>>>>> So new summary: >>>>>>> 1. Beam will base64-encode raw bytes, before passing them to BQ over >>>>>>> rest API. This will be a change in behavior for Python 2 (for good >>>>>>> reasons). >>>>>>> 2. When reading data from BQ, all fields of type BYTES will be >>>>>>> base64-decoded. >>>>>>> 3. Beam will send an API call to BigQuery to get table schema, >>>>>>> whenever schema is not supplied, to work around >>>>>>> https://issuetracker.google.com/issues/129006689. Beam will require >>>>>>> users to specify the schema when writing bytes to BQ. >>>>>>> >>>>>> >>>>>> I'm not sure why we reached this conclusion. We (Beam) does not use >>>>>> BQ schema auto detection feature currently. So why not just send an API >>>>>> signal to get the schema when users are writing to existing tables ? >>>>>> Also, >>>>>> even if we decide to support schema auto detection in the future we will >>>>>> not be able to support this for BYTEs type (due to the restriction by >>>>>> BQ). >>>>>> >>>>>> >>>>>>> Thanks all for your input on this! >>>>>>> Juta >>>>>>> >>>>>>>
