I took a closer look at BigQuery IO implementation in Beam SDK and Dataflow
runner while reviewing a few PRs to address BEAM-6769, and I think we have
to revise the course of action here.
It turns out, that when we first added support for BYTES in Java BiqQuery
IO, we designed the API with an expectation that:
- On write path the user must pass base64-encoded bytes to the BQ IO. [0]
- On read path BQ IO base64-encodes the output result, before serving it to
the user. [1]
When support for BigQuery was added to Python SDK and Dataflow runner, the
runner authors preserved the behavior of treating bytes to be consistent
with Java BQ IO - bytes must be base64-encoded by the user, and bytes from
BQ IO returned by Dataflow Python runner are base64-encoded.
Unfortunately, this behavior is not documented in public documentation or
JavaDoc/PyDocs [2-4], and there were no examples illustrating it, up until
we added integration tests a few years down the road [5,6]. Thanks to these
integration tests we discovered BEAM-6769.
I don't have context why we made a decision to avoid handling raw bytes in
Beam, however I think keeping consistent treatment of bytes across all SDKs
and runners is important for a smooth user experience, especially so when a
particular behavior is not documented well.
This being said I suggest the following:
1. Let's keep the current expectation that Beam operates only on
base64-encoded bytes in BQ IO. It may be reasonable to revise this
expectation, but it is beyond the scope of BEAM-6769.
2. Let's document current behavior of BQ IO w.r.t. of handling bytes.
Chances are that if we had such documentation, we wouldn't have had to
answer questions raised in this thread. Filed BEAM-7326 to track.
3. Let's revise Python BQ integration tests to clearly communicate that BQ
IO expects base64-encoded bytes. Filed BEAM-7327 to track.
Coming back to the original message:
When writing b’abc’ in python 2 this results in actually writing b'i\xb7'
> which is the same as base64.b64decode('abc='))
This is expected as Beam BQ IO expect users to base64-encode their bytes.
> When writing b’abc’ in python 3 this results in “TypeError: b'abc' is not
> JSON serializable”
This is a Py3-compatibility bug. We should decode bytes to a str on Python
3. Given that we expect input to be base64-encoded, we can using 'ascii'
codec.
> When writing b’\xab’ in py2/py3 this gives a “ValueError: 'utf8' codec
> can't decode byte 0xab in position 0: invalid start byte. NAN, INF and -INF
> values are not JSON compliant
This expected since b’\xab’ cannot be base64 decoded.
> When reading bytes from BQ they are currently returned as base-64 encoded
> strings rather then the raw bytes.
This is also expected.
[0]
https://github.com/apache/beam/commit/c7e0010b0d4a3c45148d05f5101f5310bb84c40c#diff-1016cd1e3092d30556292ab7b983c4c8R103
[1]
https://github.com/apache/beam/commit/c7e0010b0d4a3c45148d05f5101f5310bb84c40c#diff-44025ee9b9c94123967e1df92bfb1c04R207
[2] https://beam.apache.org/documentation/io/built-in/google-bigquery/
[3]
https://beam.apache.org/releases/pydoc/2.12.0/apache_beam.io.gcp.bigquery.html
[4]
https://beam.apache.org/releases/javadoc/2.12.0/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.html
[5]
https://github.com/apache/beam/blob/7b1abc923183a9f6336d3d44681b8fcd8785104c/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryToTableIT.java#L92
[6]
https://github.com/apache/beam/commit/d6b456dd922655b216b2c5af6548b0f5fe4eb507#diff-7f1bb65cbe782f5a27c5a75b6fe89fbcR112
On Tue, Mar 26, 2019 at 11:27 AM Pablo Estrada <[email protected]> wrote:
> Sure, we can make users explicitly ask for schema autodetection, instead
> of it being the default when no schema is provided. I think that's
> reasonable.
>
>
> On Mon, Mar 25, 2019, 7:19 PM Valentyn Tymofieiev <[email protected]>
> wrote:
>
>> Thanks everyone for input on this thread. I think there is a confusion
>> between not specifying the schema, and asking BigQuery to do schema
>> autodetection. This is not the same thing, however in recent changes to BQ
>> IO that happened after 2.11 release, we are forcing schema autodetection,
>> when schema is not specified, see: [1].
>>
>> I think we need to revise this ahead of 2.12. It may be better if users
>> explicitly opt-in to schema autodetection if they wish. Autodetection is an
>> approximation, and in particular, as we figured out in this thread, it does
>> not work correctly for BYTES data.
>>
>> I suspect that if we disable schema autodetection, and/or make previous
>> implementation of BQ sink a default option, we will be able to write BYTES
>> data to a previously created BQ table without specifying the schema, and
>> making a call to BQ to fetch the schema won't be necessary. We'd need to
>> verify that.
>>
>
>> Another interesting note, as per Juta's analysis
>> <https://docs.google.com/document/d/19zvDycWzF82MmtCmxrhqqyXKaRq8slRIjdxE6E8MObA/edit?usp=sharing>,
>> google-cloud-bigquery client does not require additional base64 encoding
>> for bytes, so once we migrate to use this client, base64 encoding/decoding
>> of Bytes data won't be necessary in Beam.
>>
>> [1]
>> https://github.com/apache/beam/blob/0b71f541e93f3bd69af87ad8a6db46ccb4a01ddc/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L321
>> .
>> [2]
>> https://docs.google.com/document/d/19zvDycWzF82MmtCmxrhqqyXKaRq8slRIjdxE6E8MObA/edit#bookmark=id.7pfrsz1c8hcj
>>
>> On Mon, Mar 25, 2019 at 2:26 PM Chamikara Jayalath <[email protected]>
>> wrote:
>>
>>>
>>>
>>> On Mon, Mar 25, 2019 at 2:16 PM Pablo Estrada <[email protected]>
>>> wrote:
>>>
>>>> +Chamikara Jayalath <[email protected]> with the new BigQuery sink,
>>>> schema autodetection is supported (it's a very simple thing to have). Do
>>>> you think we should not have it?
>>>> Best
>>>> -P.
>>>>
>>>
>>> Ah good to know. But IMO users should be able to write to existing
>>> tables without specifying a schema (when CEATE_DISPOSITION is CREATE_NEVER
>>> for example). How do users enable schema auto-detection ? Probably this
>>> should not be enabled by default and we should clearly advertise that bytes
>>> type is not supported (or support it with extra information). Just my 2
>>> cents.
>>>
>>> Thanks,
>>> Cham
>>>
>>>
>>>>
>>>> On Mon, Mar 25, 2019 at 11:01 AM Chamikara Jayalath <
>>>> [email protected]> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Mon, Mar 25, 2019 at 2:03 AM Juta Staes <[email protected]> wrote:
>>>>>
>>>>>>
>>>>>> On Mon, 25 Mar 2019 at 06:15, Valentyn Tymofieiev <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> We received feedback on
>>>>>>> https://issuetracker.google.com/issues/129006689 - BQ developers
>>>>>>> say that schema identification is done and they discourage to use schema
>>>>>>> autodetection in tables using BYTES. In light of this, I think may be
>>>>>>> fair
>>>>>>> to recommend Beam users to specify BQ schemas as well when they interact
>>>>>>> with BQ, and call out that writing binary data to BQ will likely fail
>>>>>>> unless schema is specified. Does that make sense?
>>>>>>>
>>>>>>
>>>>>> Given that schema autodetect does not work for bytes I think it is
>>>>>> indeed a good solution to require users to specify BQ schemas as well
>>>>>> when
>>>>>> they write to BQ
>>>>>>
>>>>>> So new summary:
>>>>>> 1. Beam will base64-encode raw bytes, before passing them to BQ over
>>>>>> rest API. This will be a change in behavior for Python 2 (for good
>>>>>> reasons).
>>>>>> 2. When reading data from BQ, all fields of type BYTES will be
>>>>>> base64-decoded.
>>>>>> 3. Beam will send an API call to BigQuery to get table schema,
>>>>>> whenever schema is not supplied, to work around
>>>>>> https://issuetracker.google.com/issues/129006689. Beam will require
>>>>>> users to specify the schema when writing bytes to BQ.
>>>>>>
>>>>>
>>>>> I'm not sure why we reached this conclusion. We (Beam) does not use BQ
>>>>> schema auto detection feature currently. So why not just send an API
>>>>> signal to get the schema when users are writing to existing tables ? Also,
>>>>> even if we decide to support schema auto detection in the future we will
>>>>> not be able to support this for BYTEs type (due to the restriction by BQ).
>>>>>
>>>>>
>>>>>> Thanks all for your input on this!
>>>>>> Juta
>>>>>>
>>>>>>