Re: Writing bytes to BigQuery with beam

Chamikara Jayalath Thu, 16 May 2019 13:20:27 -0700

On Wed, May 15, 2019 at 12:26 PM Valentyn Tymofieiev <[email protected]>
wrote:


> I took a closer look at BigQuery IO implementation in Beam SDK and
> Dataflow runner while reviewing a few PRs to address BEAM-6769, and I think
> we have to revise the course of action here.
>
> It turns out, that when we first added support for BYTES in Java BiqQuery
> IO, we designed the API with an expectation that:
> - On write path the user must pass base64-encoded bytes to the BQ IO. [0]
> - On read path BQ IO base64-encodes the output result, before serving it
> to the user. [1]
>
> When support for BigQuery was added to Python SDK and Dataflow runner, the
> runner authors preserved the behavior of treating bytes to be consistent
> with Java BQ IO - bytes must be base64-encoded by the user, and bytes from
> BQ IO returned by Dataflow Python runner are base64-encoded.
>
> Unfortunately, this behavior is not documented in public documentation or
> JavaDoc/PyDocs [2-4], and there were no examples illustrating it, up until
> we added integration tests a few years down the road [5,6]. Thanks to these
> integration tests we discovered BEAM-6769.
>
> I don't have context why we made a decision to avoid handling raw bytes in
> Beam, however I think keeping consistent treatment of bytes across all SDKs
> and runners is important for a smooth user experience, especially so when a
> particular behavior is not documented well.
>
> This being said I suggest the following:
> 1. Let's keep the current expectation that Beam operates only on
> base64-encoded bytes in BQ IO. It may be reasonable to revise this
> expectation, but it is beyond the scope of  BEAM-6769.
> 2. Let's document current behavior of BQ IO w.r.t. of handling bytes.
> Chances are that if we had such documentation, we wouldn't have had to
> answer questions raised in this thread. Filed BEAM-7326 to track.
> 3. Let's revise Python BQ integration tests to clearly communicate that BQ
> IO expects base64-encoded bytes. Filed BEAM-7327 to track.
>

+1 for documentation updates at least.
We probably will able to add support for writing both base64 encoded bytes
and row bytes without breaking backwards compatibility though. For reading,
we can add an option if we want to add support for automatically decoding
base64 encoded bytes.


> Coming back to the original message:
>
> When writing b’abc’ in python 2 this results in actually writing b'i\xb7'
>> which is the same as base64.b64decode('abc='))
>
> This is expected as Beam BQ IO expect users to base64-encode their bytes.
>

Does BigQuery still store data as bytes (if the type of the corresponding
field is BYTES) ?


> When writing b’abc’ in python 3 this results in “TypeError: b'abc' is not
>> JSON serializable”
>
> This is a Py3-compatibility bug. We should decode bytes to a str on Python
> 3. Given that we expect input to be base64-encoded, we can using 'ascii'
> codec.
>

Is this error coming from json library ? We write data to JSON files before
exporting to BQ and JSON cannot handle raw bytes. We can probably convert
bytes to base64 before writing JSON.


> When writing b’\xab’ in py2/py3 this gives a “ValueError: 'utf8' codec
>> can't decode byte 0xab in position 0: invalid start byte. NAN, INF and -INF
>> values are not JSON compliant
>
> This expected since b’\xab’ cannot be base64 decoded.
>

This is definitely from json library.


> When reading bytes from BQ they are currently returned as base-64 encoded
>> strings rather then the raw bytes.
>
> This is also expected.
>

We can add an option to support reading raw bytes directly as mentioned
above.


>
> [0]
> https://github.com/apache/beam/commit/c7e0010b0d4a3c45148d05f5101f5310bb84c40c#diff-1016cd1e3092d30556292ab7b983c4c8R103
>
> [1]
> https://github.com/apache/beam/commit/c7e0010b0d4a3c45148d05f5101f5310bb84c40c#diff-44025ee9b9c94123967e1df92bfb1c04R207
> [2] https://beam.apache.org/documentation/io/built-in/google-bigquery/
> [3]
> https://beam.apache.org/releases/pydoc/2.12.0/apache_beam.io.gcp.bigquery.html
> [4]
> https://beam.apache.org/releases/javadoc/2.12.0/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.html
> [5]
> https://github.com/apache/beam/blob/7b1abc923183a9f6336d3d44681b8fcd8785104c/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryToTableIT.java#L92
>
> [6]
> https://github.com/apache/beam/commit/d6b456dd922655b216b2c5af6548b0f5fe4eb507#diff-7f1bb65cbe782f5a27c5a75b6fe89fbcR112
>
>
> On Tue, Mar 26, 2019 at 11:27 AM Pablo Estrada <[email protected]> wrote:
>
>> Sure, we can make users explicitly ask for schema autodetection, instead
>> of it being the default when no schema is provided. I think that's
>> reasonable.
>>
>>
>> On Mon, Mar 25, 2019, 7:19 PM Valentyn Tymofieiev <[email protected]>
>> wrote:
>>
>>> Thanks everyone for input on this thread. I think there is a confusion
>>> between not specifying the schema, and asking BigQuery to do schema
>>> autodetection. This is not the same thing, however in recent changes to BQ
>>> IO that happened after 2.11 release, we are forcing schema autodetection,
>>> when schema is not specified, see: [1].
>>>
>>> I think we need to revise this ahead of 2.12. It may be better if users
>>> explicitly opt-in to schema autodetection if they wish. Autodetection is an
>>> approximation, and in particular, as we figured out in this thread, it does
>>> not work correctly for BYTES data.
>>>
>>> I suspect that if we disable schema autodetection, and/or make previous
>>> implementation of BQ sink a default option, we will be able to write BYTES
>>> data to a previously created BQ table without specifying the schema, and
>>> making a call to BQ to fetch the schema won't be necessary. We'd need to
>>> verify that.
>>>
>>
>>> Another interesting note, as per Juta's analysis
>>> <https://docs.google.com/document/d/19zvDycWzF82MmtCmxrhqqyXKaRq8slRIjdxE6E8MObA/edit?usp=sharing>,
>>> google-cloud-bigquery client does not require additional base64 encoding
>>> for bytes, so once we migrate to use this client, base64 encoding/decoding
>>> of Bytes data won't be necessary in Beam.
>>>
>>> [1]
>>> https://github.com/apache/beam/blob/0b71f541e93f3bd69af87ad8a6db46ccb4a01ddc/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L321
>>> .
>>> [2]
>>> https://docs.google.com/document/d/19zvDycWzF82MmtCmxrhqqyXKaRq8slRIjdxE6E8MObA/edit#bookmark=id.7pfrsz1c8hcj
>>>
>>> On Mon, Mar 25, 2019 at 2:26 PM Chamikara Jayalath <[email protected]>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Mon, Mar 25, 2019 at 2:16 PM Pablo Estrada <[email protected]>
>>>> wrote:
>>>>
>>>>> +Chamikara Jayalath <[email protected]> with the new BigQuery
>>>>> sink, schema autodetection is supported (it's a very simple thing to 
>>>>> have).
>>>>> Do you think we should not have it?
>>>>> Best
>>>>> -P.
>>>>>
>>>>
>>>> Ah good to know. But IMO users should be able to write to existing
>>>> tables without specifying a schema (when CEATE_DISPOSITION is CREATE_NEVER
>>>> for example). How do users enable schema auto-detection ? Probably this
>>>> should not be enabled by default and we should clearly advertise that bytes
>>>> type is not supported (or support it with extra information). Just my 2
>>>> cents.
>>>>
>>>> Thanks,
>>>> Cham
>>>>
>>>>
>>>>>
>>>>> On Mon, Mar 25, 2019 at 11:01 AM Chamikara Jayalath <
>>>>> [email protected]> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Mar 25, 2019 at 2:03 AM Juta Staes <[email protected]> wrote:
>>>>>>
>>>>>>>
>>>>>>> On Mon, 25 Mar 2019 at 06:15, Valentyn Tymofieiev <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> We received feedback on
>>>>>>>> https://issuetracker.google.com/issues/129006689 - BQ developers
>>>>>>>> say that schema identification is done and they discourage to use 
>>>>>>>> schema
>>>>>>>> autodetection in tables using BYTES. In light of this, I think may be 
>>>>>>>> fair
>>>>>>>> to recommend Beam users to specify BQ schemas as well when they 
>>>>>>>> interact
>>>>>>>> with BQ, and call out that writing binary data to BQ will likely fail
>>>>>>>> unless schema is specified. Does that make sense?
>>>>>>>>
>>>>>>>
>>>>>>> Given that schema autodetect does not work for bytes I think it is
>>>>>>> indeed a good solution to require users to specify BQ schemas as well 
>>>>>>> when
>>>>>>> they write to BQ
>>>>>>>
>>>>>>> So new summary:
>>>>>>> 1. Beam will base64-encode raw bytes, before passing them to BQ over
>>>>>>> rest API. This will be a change in behavior for Python 2 (for good 
>>>>>>> reasons).
>>>>>>> 2. When reading data from BQ, all fields of type BYTES will be
>>>>>>> base64-decoded.
>>>>>>> 3. Beam will send an API call to BigQuery to get table schema,
>>>>>>> whenever schema is not supplied, to work around
>>>>>>> https://issuetracker.google.com/issues/129006689. Beam will require
>>>>>>> users to specify the schema when writing bytes to BQ.
>>>>>>>
>>>>>>
>>>>>> I'm not sure why we reached this conclusion. We (Beam) does not use
>>>>>> BQ schema auto detection feature currently.  So why not just send an API
>>>>>> signal to get the schema when users are writing to existing tables ? 
>>>>>> Also,
>>>>>> even if we decide to support schema auto detection in the future we will
>>>>>> not be able to support this for BYTEs type (due to the restriction by 
>>>>>> BQ).
>>>>>>
>>>>>>
>>>>>>> Thanks all for your input on this!
>>>>>>> Juta
>>>>>>>
>>>>>>>

Re: Writing bytes to BigQuery with beam

Reply via email to