Sure, we can make users explicitly ask for schema autodetection, instead of
it being the default when no schema is provided. I think that's reasonable.


On Mon, Mar 25, 2019, 7:19 PM Valentyn Tymofieiev <[email protected]>
wrote:

> Thanks everyone for input on this thread. I think there is a confusion
> between not specifying the schema, and asking BigQuery to do schema
> autodetection. This is not the same thing, however in recent changes to BQ
> IO that happened after 2.11 release, we are forcing schema autodetection,
> when schema is not specified, see: [1].
>
> I think we need to revise this ahead of 2.12. It may be better if users
> explicitly opt-in to schema autodetection if they wish. Autodetection is an
> approximation, and in particular, as we figured out in this thread, it does
> not work correctly for BYTES data.
>
> I suspect that if we disable schema autodetection, and/or make previous
> implementation of BQ sink a default option, we will be able to write BYTES
> data to a previously created BQ table without specifying the schema, and
> making a call to BQ to fetch the schema won't be necessary. We'd need to
> verify that.
>

> Another interesting note, as per Juta's analysis
> <https://docs.google.com/document/d/19zvDycWzF82MmtCmxrhqqyXKaRq8slRIjdxE6E8MObA/edit?usp=sharing>,
> google-cloud-bigquery client does not require additional base64 encoding
> for bytes, so once we migrate to use this client, base64 encoding/decoding
> of Bytes data won't be necessary in Beam.
>
> [1]
> https://github.com/apache/beam/blob/0b71f541e93f3bd69af87ad8a6db46ccb4a01ddc/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L321
> .
> [2]
> https://docs.google.com/document/d/19zvDycWzF82MmtCmxrhqqyXKaRq8slRIjdxE6E8MObA/edit#bookmark=id.7pfrsz1c8hcj
>
> On Mon, Mar 25, 2019 at 2:26 PM Chamikara Jayalath <[email protected]>
> wrote:
>
>>
>>
>> On Mon, Mar 25, 2019 at 2:16 PM Pablo Estrada <[email protected]> wrote:
>>
>>> +Chamikara Jayalath <[email protected]> with the new BigQuery sink,
>>> schema autodetection is supported (it's a very simple thing to have). Do
>>> you think we should not have it?
>>> Best
>>> -P.
>>>
>>
>> Ah good to know. But IMO users should be able to write to existing tables
>> without specifying a schema (when CEATE_DISPOSITION is CREATE_NEVER for
>> example). How do users enable schema auto-detection ? Probably this should
>> not be enabled by default and we should clearly advertise that bytes type
>> is not supported (or support it with extra information). Just my 2 cents.
>>
>> Thanks,
>> Cham
>>
>>
>>>
>>> On Mon, Mar 25, 2019 at 11:01 AM Chamikara Jayalath <
>>> [email protected]> wrote:
>>>
>>>>
>>>>
>>>> On Mon, Mar 25, 2019 at 2:03 AM Juta Staes <[email protected]> wrote:
>>>>
>>>>>
>>>>> On Mon, 25 Mar 2019 at 06:15, Valentyn Tymofieiev <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> We received feedback on
>>>>>> https://issuetracker.google.com/issues/129006689 - BQ developers say
>>>>>> that schema identification is done and they discourage to use schema
>>>>>> autodetection in tables using BYTES. In light of this, I think may be 
>>>>>> fair
>>>>>> to recommend Beam users to specify BQ schemas as well when they interact
>>>>>> with BQ, and call out that writing binary data to BQ will likely fail
>>>>>> unless schema is specified. Does that make sense?
>>>>>>
>>>>>
>>>>> Given that schema autodetect does not work for bytes I think it is
>>>>> indeed a good solution to require users to specify BQ schemas as well when
>>>>> they write to BQ
>>>>>
>>>>> So new summary:
>>>>> 1. Beam will base64-encode raw bytes, before passing them to BQ over
>>>>> rest API. This will be a change in behavior for Python 2 (for good 
>>>>> reasons).
>>>>> 2. When reading data from BQ, all fields of type BYTES will be
>>>>> base64-decoded.
>>>>> 3. Beam will send an API call to BigQuery to get table schema,
>>>>> whenever schema is not supplied, to work around
>>>>> https://issuetracker.google.com/issues/129006689. Beam will require
>>>>> users to specify the schema when writing bytes to BQ.
>>>>>
>>>>
>>>> I'm not sure why we reached this conclusion. We (Beam) does not use BQ
>>>> schema auto detection feature currently.  So why not just send an API
>>>> signal to get the schema when users are writing to existing tables ? Also,
>>>> even if we decide to support schema auto detection in the future we will
>>>> not be able to support this for BYTEs type (due to the restriction by BQ).
>>>>
>>>>
>>>>> Thanks all for your input on this!
>>>>> Juta
>>>>>
>>>>>

Reply via email to