Re: Writing bytes to BigQuery with beam

Valentyn Tymofieiev Thu, 16 May 2019 16:45:09 -0700

Also, I filed https://issues.apache.org/jira/browse/BEAM-7346 to add more
tests to Go SDK and verify the consistency of BQ IO behavior w.r.t.
handling BYTES.


On Thu, May 16, 2019 at 4:42 PM Valentyn Tymofieiev <[email protected]>
wrote:

>
> On Thu, May 16, 2019 at 1:12 PM Chamikara Jayalath <[email protected]>
> wrote:
>
>>
>>
>> On Wed, May 15, 2019 at 12:26 PM Valentyn Tymofieiev <[email protected]>
>> wrote:
>>
>>> I took a closer look at BigQuery IO implementation in Beam SDK and
>>> Dataflow runner while reviewing a few PRs to address BEAM-6769, and I think
>>> we have to revise the course of action here.
>>>
>>> It turns out, that when we first added support for BYTES in Java
>>> BiqQuery IO, we designed the API with an expectation that:
>>> - On write path the user must pass base64-encoded bytes to the BQ IO.
>>> [0]
>>> - On read path BQ IO base64-encodes the output result, before serving it
>>> to the user. [1]
>>>
>>> When support for BigQuery was added to Python SDK and Dataflow runner,
>>> the runner authors preserved the behavior of treating bytes to be
>>> consistent with Java BQ IO - bytes must be base64-encoded by the user, and
>>> bytes from BQ IO returned by Dataflow Python runner are base64-encoded.
>>>
>>> Unfortunately, this behavior is not documented in public documentation
>>> or JavaDoc/PyDocs [2-4], and there were no examples illustrating it, up
>>> until we added integration tests a few years down the road [5,6]. Thanks to
>>> these integration tests we discovered BEAM-6769.
>>>
>>> I don't have context why we made a decision to avoid handling raw bytes
>>> in Beam, however I think keeping consistent treatment of bytes across all
>>> SDKs and runners is important for a smooth user experience, especially so
>>> when a particular behavior is not documented well.
>>>
>>> This being said I suggest the following:
>>> 1. Let's keep the current expectation that Beam operates only on
>>> base64-encoded bytes in BQ IO. It may be reasonable to revise this
>>> expectation, but it is beyond the scope of  BEAM-6769.
>>> 2. Let's document current behavior of BQ IO w.r.t. of handling bytes.
>>> Chances are that if we had such documentation, we wouldn't have had to
>>> answer questions raised in this thread. Filed BEAM-7326 to track.
>>> 3. Let's revise Python BQ integration tests to clearly communicate that
>>> BQ IO expects base64-encoded bytes. Filed BEAM-7327 to track.
>>>
>>
>> +1 for documentation updates at least.
>> We probably will able to add support for writing both base64 encoded
>> bytes and row bytes without breaking backwards compatibility though. For
>> reading, we can add an option if we want to add support for automatically
>> decoding base64 encoded bytes.
>>
> Filed https://issues.apache.org/jira/browse/BEAM-7344 to consider this.
>
>>
>>> Coming back to the original message:
>>>
>>> When writing b’abc’ in python 2 this results in actually writing
>>>> b'i\xb7' which is the same as base64.b64decode('abc='))
>>>
>>> This is expected as Beam BQ IO expect users to base64-encode their bytes.
>>>
>>
>> Does BigQuery still store data as bytes (if the type of the corresponding
>> field is BYTES) ?
>>
>
> Yes, I believe BigQuery stores data as raw BYTES, but in the BQ UI / bq
> shell bytes are visualized in base64-encoded form.
>
>
>> When writing b’abc’ in python 3 this results in “TypeError: b'abc' is not
>>>> JSON serializable”
>>>
>>> This is a Py3-compatibility bug. We should decode bytes to a str on
>>> Python 3. Given that we expect input to be base64-encoded, we can using
>>> 'ascii' codec.
>>>
>>
>> Is this error coming from json library ? We write data to JSON files
>> before exporting to BQ and JSON cannot handle raw bytes. We can probably
>> convert bytes to base64 before writing JSON.
>>
>
> I think it's coming from json.dumps() which does not accept bytes, a
> decode to a str should fix this.
>
>
>> When writing b’\xab’ in py2/py3 this gives a “ValueError: 'utf8' codec
>>>> can't decode byte 0xab in position 0: invalid start byte. NAN, INF and -INF
>>>> values are not JSON compliant
>>>
>>> This expected since b’\xab’ cannot be base64 decoded.
>>>
>>
>> This is definitely from json library.
>>
>>
>>> When reading bytes from BQ they are currently returned as base-64
>>>> encoded strings rather then the raw bytes.
>>>
>>> This is also expected.
>>>
>>
>> We can add an option to support reading raw bytes directly as mentioned
>> above.
>>
> Feel free to comment on BEAM-7344.
>
>>
>>
>>>
>>> [0]
>>> https://github.com/apache/beam/commit/c7e0010b0d4a3c45148d05f5101f5310bb84c40c#diff-1016cd1e3092d30556292ab7b983c4c8R103
>>>
>>> [1]
>>> https://github.com/apache/beam/commit/c7e0010b0d4a3c45148d05f5101f5310bb84c40c#diff-44025ee9b9c94123967e1df92bfb1c04R207
>>> [2] https://beam.apache.org/documentation/io/built-in/google-bigquery/
>>> [3]
>>> https://beam.apache.org/releases/pydoc/2.12.0/apache_beam.io.gcp.bigquery.html
>>> [4]
>>> https://beam.apache.org/releases/javadoc/2.12.0/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.html
>>> [5]
>>> https://github.com/apache/beam/blob/7b1abc923183a9f6336d3d44681b8fcd8785104c/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryToTableIT.java#L92
>>>
>>> [6]
>>> https://github.com/apache/beam/commit/d6b456dd922655b216b2c5af6548b0f5fe4eb507#diff-7f1bb65cbe782f5a27c5a75b6fe89fbcR112
>>>
>>>
>>> On Tue, Mar 26, 2019 at 11:27 AM Pablo Estrada <[email protected]>
>>> wrote:
>>>
>>>> Sure, we can make users explicitly ask for schema autodetection,
>>>> instead of it being the default when no schema is provided. I think that's
>>>> reasonable.
>>>>
>>>>
>>>> On Mon, Mar 25, 2019, 7:19 PM Valentyn Tymofieiev <[email protected]>
>>>> wrote:
>>>>
>>>>> Thanks everyone for input on this thread. I think there is a confusion
>>>>> between not specifying the schema, and asking BigQuery to do schema
>>>>> autodetection. This is not the same thing, however in recent changes to BQ
>>>>> IO that happened after 2.11 release, we are forcing schema autodetection,
>>>>> when schema is not specified, see: [1].
>>>>>
>>>>> I think we need to revise this ahead of 2.12. It may be better if
>>>>> users explicitly opt-in to schema autodetection if they wish. 
>>>>> Autodetection
>>>>> is an approximation, and in particular, as we figured out in this thread,
>>>>> it does not work correctly for BYTES data.
>>>>>
>>>>> I suspect that if we disable schema autodetection, and/or make
>>>>> previous implementation of BQ sink a default option, we will be able to
>>>>> write BYTES data to a previously created BQ table without specifying the
>>>>> schema, and making a call to BQ to fetch the schema won't be necessary.
>>>>> We'd need to verify that.
>>>>>
>>>>
>>>>> Another interesting note, as per Juta's analysis
>>>>> <https://docs.google.com/document/d/19zvDycWzF82MmtCmxrhqqyXKaRq8slRIjdxE6E8MObA/edit?usp=sharing>,
>>>>> google-cloud-bigquery client does not require additional base64 encoding
>>>>> for bytes, so once we migrate to use this client, base64 encoding/decoding
>>>>> of Bytes data won't be necessary in Beam.
>>>>>
>>>>> [1]
>>>>> https://github.com/apache/beam/blob/0b71f541e93f3bd69af87ad8a6db46ccb4a01ddc/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L321
>>>>> .
>>>>> [2]
>>>>> https://docs.google.com/document/d/19zvDycWzF82MmtCmxrhqqyXKaRq8slRIjdxE6E8MObA/edit#bookmark=id.7pfrsz1c8hcj
>>>>>
>>>>> On Mon, Mar 25, 2019 at 2:26 PM Chamikara Jayalath <
>>>>> [email protected]> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Mar 25, 2019 at 2:16 PM Pablo Estrada <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> +Chamikara Jayalath <[email protected]> with the new BigQuery
>>>>>>> sink, schema autodetection is supported (it's a very simple thing to 
>>>>>>> have).
>>>>>>> Do you think we should not have it?
>>>>>>> Best
>>>>>>> -P.
>>>>>>>
>>>>>>
>>>>>> Ah good to know. But IMO users should be able to write to existing
>>>>>> tables without specifying a schema (when CEATE_DISPOSITION is 
>>>>>> CREATE_NEVER
>>>>>> for example). How do users enable schema auto-detection ? Probably this
>>>>>> should not be enabled by default and we should clearly advertise that 
>>>>>> bytes
>>>>>> type is not supported (or support it with extra information). Just my 2
>>>>>> cents.
>>>>>>
>>>>>> Thanks,
>>>>>> Cham
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> On Mon, Mar 25, 2019 at 11:01 AM Chamikara Jayalath <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Mar 25, 2019 at 2:03 AM Juta Staes <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, 25 Mar 2019 at 06:15, Valentyn Tymofieiev <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> We received feedback on
>>>>>>>>>> https://issuetracker.google.com/issues/129006689 - BQ developers
>>>>>>>>>> say that schema identification is done and they discourage to use 
>>>>>>>>>> schema
>>>>>>>>>> autodetection in tables using BYTES. In light of this, I think may 
>>>>>>>>>> be fair
>>>>>>>>>> to recommend Beam users to specify BQ schemas as well when they 
>>>>>>>>>> interact
>>>>>>>>>> with BQ, and call out that writing binary data to BQ will likely fail
>>>>>>>>>> unless schema is specified. Does that make sense?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Given that schema autodetect does not work for bytes I think it is
>>>>>>>>> indeed a good solution to require users to specify BQ schemas as well 
>>>>>>>>> when
>>>>>>>>> they write to BQ
>>>>>>>>>
>>>>>>>>> So new summary:
>>>>>>>>> 1. Beam will base64-encode raw bytes, before passing them to BQ
>>>>>>>>> over rest API. This will be a change in behavior for Python 2 (for 
>>>>>>>>> good
>>>>>>>>> reasons).
>>>>>>>>> 2. When reading data from BQ, all fields of type BYTES will be
>>>>>>>>> base64-decoded.
>>>>>>>>> 3. Beam will send an API call to BigQuery to get table schema,
>>>>>>>>> whenever schema is not supplied, to work around
>>>>>>>>> https://issuetracker.google.com/issues/129006689. Beam will
>>>>>>>>> require users to specify the schema when writing bytes to BQ.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I'm not sure why we reached this conclusion. We (Beam) does not use
>>>>>>>> BQ schema auto detection feature currently.  So why not just send an 
>>>>>>>> API
>>>>>>>> signal to get the schema when users are writing to existing tables ? 
>>>>>>>> Also,
>>>>>>>> even if we decide to support schema auto detection in the future we 
>>>>>>>> will
>>>>>>>> not be able to support this for BYTEs type (due to the restriction by 
>>>>>>>> BQ).
>>>>>>>>
>>>>>>>>
>>>>>>>>> Thanks all for your input on this!
>>>>>>>>> Juta
>>>>>>>>>
>>>>>>>>>

Re: Writing bytes to BigQuery with beam

Reply via email to