Re: Writing bytes to BigQuery with beam

Robert Burke Wed, 15 May 2019 13:35:07 -0700

For the Go SDK:
BigQueryIO
<https://github.com/apache/beam/tree/master/sdks/go/pkg/beam/io/bigqueryio>
exists,
but other than maybe one PR that added batching of writes (to avoid the
size limit communicating with BigQuery), the reads are probably going to be
re-written
I don't believe there's any special handling of base64 bytes by the IO
code. Users pass in their types, and the assumption is they use BigQuery
compatible Go schemas. eg. tornadoes example
<https://github.com/apache/beam/blob/master/sdks/go/examples/cookbook/tornadoes/tornadoes.go#L60>
,


On Wed, 15 May 2019 at 12:41, Valentyn Tymofieiev <[email protected]>
wrote:

> By the way, does anyone know what is the status of BigQuery connector in
> Beam Go and Beam SQL? Perhaps some folks working on these SDKs can chime in
> here.
> I am curious whether these SDKs also make / will make it a responsibility
> of the user to base64-encode bytes. As I mentioned above, it is desirable
> to have a consistent UX across SDK, especially given that we are working on
> adding support for cross-language pipelines (
> https://beam.apache.org/roadmap/connectors-multi-sdk/).
>
> On Wed, May 15, 2019 at 12:26 PM Valentyn Tymofieiev <[email protected]>
> wrote:
>
>> I took a closer look at BigQuery IO implementation in Beam SDK and
>> Dataflow runner while reviewing a few PRs to address BEAM-6769, and I think
>> we have to revise the course of action here.
>>
>> It turns out, that when we first added support for BYTES in Java BiqQuery
>> IO, we designed the API with an expectation that:
>> - On write path the user must pass base64-encoded bytes to the BQ IO. [0]
>> - On read path BQ IO base64-encodes the output result, before serving it
>> to the user. [1]
>>
>> When support for BigQuery was added to Python SDK and Dataflow runner,
>> the runner authors preserved the behavior of treating bytes to be
>> consistent with Java BQ IO - bytes must be base64-encoded by the user, and
>> bytes from BQ IO returned by Dataflow Python runner are base64-encoded.
>>
>> Unfortunately, this behavior is not documented in public documentation or
>> JavaDoc/PyDocs [2-4], and there were no examples illustrating it, up until
>> we added integration tests a few years down the road [5,6]. Thanks to these
>> integration tests we discovered BEAM-6769.
>>
>> I don't have context why we made a decision to avoid handling raw bytes
>> in Beam, however I think keeping consistent treatment of bytes across all
>> SDKs and runners is important for a smooth user experience, especially so
>> when a particular behavior is not documented well.
>>
>> This being said I suggest the following:
>> 1. Let's keep the current expectation that Beam operates only on
>> base64-encoded bytes in BQ IO. It may be reasonable to revise this
>> expectation, but it is beyond the scope of  BEAM-6769.
>> 2. Let's document current behavior of BQ IO w.r.t. of handling bytes.
>> Chances are that if we had such documentation, we wouldn't have had to
>> answer questions raised in this thread. Filed BEAM-7326 to track.
>> 3. Let's revise Python BQ integration tests to clearly communicate that
>> BQ IO expects base64-encoded bytes. Filed BEAM-7327 to track.
>>
>> Coming back to the original message:
>>
>> When writing b’abc’ in python 2 this results in actually writing b'i\xb7'
>>> which is the same as base64.b64decode('abc='))
>>
>> This is expected as Beam BQ IO expect users to base64-encode their bytes.
>>
>>> When writing b’abc’ in python 3 this results in “TypeError: b'abc' is
>>> not JSON serializable”
>>
>> This is a Py3-compatibility bug. We should decode bytes to a str on
>> Python 3. Given that we expect input to be base64-encoded, we can using
>> 'ascii' codec.
>>
>>> When writing b’\xab’ in py2/py3 this gives a “ValueError: 'utf8' codec
>>> can't decode byte 0xab in position 0: invalid start byte. NAN, INF and -INF
>>> values are not JSON compliant
>>
>> This expected since b’\xab’ cannot be base64 decoded.
>>
>>> When reading bytes from BQ they are currently returned as base-64
>>> encoded strings rather then the raw bytes.
>>
>> This is also expected.
>>
>> [0]
>> https://github.com/apache/beam/commit/c7e0010b0d4a3c45148d05f5101f5310bb84c40c#diff-1016cd1e3092d30556292ab7b983c4c8R103
>>
>> [1]
>> https://github.com/apache/beam/commit/c7e0010b0d4a3c45148d05f5101f5310bb84c40c#diff-44025ee9b9c94123967e1df92bfb1c04R207
>> [2] https://beam.apache.org/documentation/io/built-in/google-bigquery/
>> [3]
>> https://beam.apache.org/releases/pydoc/2.12.0/apache_beam.io.gcp.bigquery.html
>> [4]
>> https://beam.apache.org/releases/javadoc/2.12.0/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.html
>> [5]
>> https://github.com/apache/beam/blob/7b1abc923183a9f6336d3d44681b8fcd8785104c/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryToTableIT.java#L92
>>
>> [6]
>> https://github.com/apache/beam/commit/d6b456dd922655b216b2c5af6548b0f5fe4eb507#diff-7f1bb65cbe782f5a27c5a75b6fe89fbcR112
>>
>>
>> On Tue, Mar 26, 2019 at 11:27 AM Pablo Estrada <[email protected]>
>> wrote:
>>
>>> Sure, we can make users explicitly ask for schema autodetection, instead
>>> of it being the default when no schema is provided. I think that's
>>> reasonable.
>>>
>>>
>>> On Mon, Mar 25, 2019, 7:19 PM Valentyn Tymofieiev <[email protected]>
>>> wrote:
>>>
>>>> Thanks everyone for input on this thread. I think there is a confusion
>>>> between not specifying the schema, and asking BigQuery to do schema
>>>> autodetection. This is not the same thing, however in recent changes to BQ
>>>> IO that happened after 2.11 release, we are forcing schema autodetection,
>>>> when schema is not specified, see: [1].
>>>>
>>>> I think we need to revise this ahead of 2.12. It may be better if users
>>>> explicitly opt-in to schema autodetection if they wish. Autodetection is an
>>>> approximation, and in particular, as we figured out in this thread, it does
>>>> not work correctly for BYTES data.
>>>>
>>>> I suspect that if we disable schema autodetection, and/or make previous
>>>> implementation of BQ sink a default option, we will be able to write BYTES
>>>> data to a previously created BQ table without specifying the schema, and
>>>> making a call to BQ to fetch the schema won't be necessary. We'd need to
>>>> verify that.
>>>>
>>>
>>>> Another interesting note, as per Juta's analysis
>>>> <https://docs.google.com/document/d/19zvDycWzF82MmtCmxrhqqyXKaRq8slRIjdxE6E8MObA/edit?usp=sharing>,
>>>> google-cloud-bigquery client does not require additional base64 encoding
>>>> for bytes, so once we migrate to use this client, base64 encoding/decoding
>>>> of Bytes data won't be necessary in Beam.
>>>>
>>>> [1]
>>>> https://github.com/apache/beam/blob/0b71f541e93f3bd69af87ad8a6db46ccb4a01ddc/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L321
>>>> .
>>>> [2]
>>>> https://docs.google.com/document/d/19zvDycWzF82MmtCmxrhqqyXKaRq8slRIjdxE6E8MObA/edit#bookmark=id.7pfrsz1c8hcj
>>>>
>>>> On Mon, Mar 25, 2019 at 2:26 PM Chamikara Jayalath <
>>>> [email protected]> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Mon, Mar 25, 2019 at 2:16 PM Pablo Estrada <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> +Chamikara Jayalath <[email protected]> with the new BigQuery
>>>>>> sink, schema autodetection is supported (it's a very simple thing to 
>>>>>> have).
>>>>>> Do you think we should not have it?
>>>>>> Best
>>>>>> -P.
>>>>>>
>>>>>
>>>>> Ah good to know. But IMO users should be able to write to existing
>>>>> tables without specifying a schema (when CEATE_DISPOSITION is CREATE_NEVER
>>>>> for example). How do users enable schema auto-detection ? Probably this
>>>>> should not be enabled by default and we should clearly advertise that 
>>>>> bytes
>>>>> type is not supported (or support it with extra information). Just my 2
>>>>> cents.
>>>>>
>>>>> Thanks,
>>>>> Cham
>>>>>
>>>>>
>>>>>>
>>>>>> On Mon, Mar 25, 2019 at 11:01 AM Chamikara Jayalath <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Mar 25, 2019 at 2:03 AM Juta Staes <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, 25 Mar 2019 at 06:15, Valentyn Tymofieiev <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> We received feedback on
>>>>>>>>> https://issuetracker.google.com/issues/129006689 - BQ developers
>>>>>>>>> say that schema identification is done and they discourage to use 
>>>>>>>>> schema
>>>>>>>>> autodetection in tables using BYTES. In light of this, I think may be 
>>>>>>>>> fair
>>>>>>>>> to recommend Beam users to specify BQ schemas as well when they 
>>>>>>>>> interact
>>>>>>>>> with BQ, and call out that writing binary data to BQ will likely fail
>>>>>>>>> unless schema is specified. Does that make sense?
>>>>>>>>>
>>>>>>>>
>>>>>>>> Given that schema autodetect does not work for bytes I think it is
>>>>>>>> indeed a good solution to require users to specify BQ schemas as well 
>>>>>>>> when
>>>>>>>> they write to BQ
>>>>>>>>
>>>>>>>> So new summary:
>>>>>>>> 1. Beam will base64-encode raw bytes, before passing them to BQ
>>>>>>>> over rest API. This will be a change in behavior for Python 2 (for good
>>>>>>>> reasons).
>>>>>>>> 2. When reading data from BQ, all fields of type BYTES will be
>>>>>>>> base64-decoded.
>>>>>>>> 3. Beam will send an API call to BigQuery to get table schema,
>>>>>>>> whenever schema is not supplied, to work around
>>>>>>>> https://issuetracker.google.com/issues/129006689. Beam will
>>>>>>>> require users to specify the schema when writing bytes to BQ.
>>>>>>>>
>>>>>>>
>>>>>>> I'm not sure why we reached this conclusion. We (Beam) does not use
>>>>>>> BQ schema auto detection feature currently.  So why not just send an API
>>>>>>> signal to get the schema when users are writing to existing tables ? 
>>>>>>> Also,
>>>>>>> even if we decide to support schema auto detection in the future we will
>>>>>>> not be able to support this for BYTEs type (due to the restriction by 
>>>>>>> BQ).
>>>>>>>
>>>>>>>
>>>>>>>> Thanks all for your input on this!
>>>>>>>> Juta
>>>>>>>>
>>>>>>>>

Re: Writing bytes to BigQuery with beam

Reply via email to