Thanks everyone for input on this thread. I think there is a confusion
between not specifying the schema, and asking BigQuery to do schema
autodetection. This is not the same thing, however in recent changes to BQ
IO that happened after 2.11 release, we are forcing schema autodetection,
when schema is not specified, see: [1].

I think we need to revise this ahead of 2.12. It may be better if users
explicitly opt-in to schema autodetection if they wish. Autodetection is an
approximation, and in particular, as we figured out in this thread, it does
not work correctly for BYTES data.

I suspect that if we disable schema autodetection, and/or make previous
implementation of BQ sink a default option, we will be able to write BYTES
data to a previously created BQ table without specifying the schema, and
making a call to BQ to fetch the schema won't be necessary. We'd need to
verify that.

Another interesting note, as per Juta's analysis
<https://docs.google.com/document/d/19zvDycWzF82MmtCmxrhqqyXKaRq8slRIjdxE6E8MObA/edit?usp=sharing>,
google-cloud-bigquery client does not require additional base64 encoding
for bytes, so once we migrate to use this client, base64 encoding/decoding
of Bytes data won't be necessary in Beam.

[1]
https://github.com/apache/beam/blob/0b71f541e93f3bd69af87ad8a6db46ccb4a01ddc/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L321
.
[2]
https://docs.google.com/document/d/19zvDycWzF82MmtCmxrhqqyXKaRq8slRIjdxE6E8MObA/edit#bookmark=id.7pfrsz1c8hcj

On Mon, Mar 25, 2019 at 2:26 PM Chamikara Jayalath <[email protected]>
wrote:

>
>
> On Mon, Mar 25, 2019 at 2:16 PM Pablo Estrada <[email protected]> wrote:
>
>> +Chamikara Jayalath <[email protected]> with the new BigQuery sink,
>> schema autodetection is supported (it's a very simple thing to have). Do
>> you think we should not have it?
>> Best
>> -P.
>>
>
> Ah good to know. But IMO users should be able to write to existing tables
> without specifying a schema (when CEATE_DISPOSITION is CREATE_NEVER for
> example). How do users enable schema auto-detection ? Probably this should
> not be enabled by default and we should clearly advertise that bytes type
> is not supported (or support it with extra information). Just my 2 cents.
>
> Thanks,
> Cham
>
>
>>
>> On Mon, Mar 25, 2019 at 11:01 AM Chamikara Jayalath <[email protected]>
>> wrote:
>>
>>>
>>>
>>> On Mon, Mar 25, 2019 at 2:03 AM Juta Staes <[email protected]> wrote:
>>>
>>>>
>>>> On Mon, 25 Mar 2019 at 06:15, Valentyn Tymofieiev <[email protected]>
>>>> wrote:
>>>>
>>>>> We received feedback on
>>>>> https://issuetracker.google.com/issues/129006689 - BQ developers say
>>>>> that schema identification is done and they discourage to use schema
>>>>> autodetection in tables using BYTES. In light of this, I think may be fair
>>>>> to recommend Beam users to specify BQ schemas as well when they interact
>>>>> with BQ, and call out that writing binary data to BQ will likely fail
>>>>> unless schema is specified. Does that make sense?
>>>>>
>>>>
>>>> Given that schema autodetect does not work for bytes I think it is
>>>> indeed a good solution to require users to specify BQ schemas as well when
>>>> they write to BQ
>>>>
>>>> So new summary:
>>>> 1. Beam will base64-encode raw bytes, before passing them to BQ over
>>>> rest API. This will be a change in behavior for Python 2 (for good 
>>>> reasons).
>>>> 2. When reading data from BQ, all fields of type BYTES will be
>>>> base64-decoded.
>>>> 3. Beam will send an API call to BigQuery to get table schema,
>>>> whenever schema is not supplied, to work around
>>>> https://issuetracker.google.com/issues/129006689. Beam will require
>>>> users to specify the schema when writing bytes to BQ.
>>>>
>>>
>>> I'm not sure why we reached this conclusion. We (Beam) does not use BQ
>>> schema auto detection feature currently.  So why not just send an API
>>> signal to get the schema when users are writing to existing tables ? Also,
>>> even if we decide to support schema auto detection in the future we will
>>> not be able to support this for BYTEs type (due to the restriction by BQ).
>>>
>>>
>>>> Thanks all for your input on this!
>>>> Juta
>>>>
>>>>

Reply via email to