Sure, we can make users explicitly ask for schema autodetection, instead of it being the default when no schema is provided. I think that's reasonable.
On Mon, Mar 25, 2019, 7:19 PM Valentyn Tymofieiev <[email protected]> wrote: > Thanks everyone for input on this thread. I think there is a confusion > between not specifying the schema, and asking BigQuery to do schema > autodetection. This is not the same thing, however in recent changes to BQ > IO that happened after 2.11 release, we are forcing schema autodetection, > when schema is not specified, see: [1]. > > I think we need to revise this ahead of 2.12. It may be better if users > explicitly opt-in to schema autodetection if they wish. Autodetection is an > approximation, and in particular, as we figured out in this thread, it does > not work correctly for BYTES data. > > I suspect that if we disable schema autodetection, and/or make previous > implementation of BQ sink a default option, we will be able to write BYTES > data to a previously created BQ table without specifying the schema, and > making a call to BQ to fetch the schema won't be necessary. We'd need to > verify that. > > Another interesting note, as per Juta's analysis > <https://docs.google.com/document/d/19zvDycWzF82MmtCmxrhqqyXKaRq8slRIjdxE6E8MObA/edit?usp=sharing>, > google-cloud-bigquery client does not require additional base64 encoding > for bytes, so once we migrate to use this client, base64 encoding/decoding > of Bytes data won't be necessary in Beam. > > [1] > https://github.com/apache/beam/blob/0b71f541e93f3bd69af87ad8a6db46ccb4a01ddc/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L321 > . > [2] > https://docs.google.com/document/d/19zvDycWzF82MmtCmxrhqqyXKaRq8slRIjdxE6E8MObA/edit#bookmark=id.7pfrsz1c8hcj > > On Mon, Mar 25, 2019 at 2:26 PM Chamikara Jayalath <[email protected]> > wrote: > >> >> >> On Mon, Mar 25, 2019 at 2:16 PM Pablo Estrada <[email protected]> wrote: >> >>> +Chamikara Jayalath <[email protected]> with the new BigQuery sink, >>> schema autodetection is supported (it's a very simple thing to have). Do >>> you think we should not have it? >>> Best >>> -P. >>> >> >> Ah good to know. But IMO users should be able to write to existing tables >> without specifying a schema (when CEATE_DISPOSITION is CREATE_NEVER for >> example). How do users enable schema auto-detection ? Probably this should >> not be enabled by default and we should clearly advertise that bytes type >> is not supported (or support it with extra information). Just my 2 cents. >> >> Thanks, >> Cham >> >> >>> >>> On Mon, Mar 25, 2019 at 11:01 AM Chamikara Jayalath < >>> [email protected]> wrote: >>> >>>> >>>> >>>> On Mon, Mar 25, 2019 at 2:03 AM Juta Staes <[email protected]> wrote: >>>> >>>>> >>>>> On Mon, 25 Mar 2019 at 06:15, Valentyn Tymofieiev <[email protected]> >>>>> wrote: >>>>> >>>>>> We received feedback on >>>>>> https://issuetracker.google.com/issues/129006689 - BQ developers say >>>>>> that schema identification is done and they discourage to use schema >>>>>> autodetection in tables using BYTES. In light of this, I think may be >>>>>> fair >>>>>> to recommend Beam users to specify BQ schemas as well when they interact >>>>>> with BQ, and call out that writing binary data to BQ will likely fail >>>>>> unless schema is specified. Does that make sense? >>>>>> >>>>> >>>>> Given that schema autodetect does not work for bytes I think it is >>>>> indeed a good solution to require users to specify BQ schemas as well when >>>>> they write to BQ >>>>> >>>>> So new summary: >>>>> 1. Beam will base64-encode raw bytes, before passing them to BQ over >>>>> rest API. This will be a change in behavior for Python 2 (for good >>>>> reasons). >>>>> 2. When reading data from BQ, all fields of type BYTES will be >>>>> base64-decoded. >>>>> 3. Beam will send an API call to BigQuery to get table schema, >>>>> whenever schema is not supplied, to work around >>>>> https://issuetracker.google.com/issues/129006689. Beam will require >>>>> users to specify the schema when writing bytes to BQ. >>>>> >>>> >>>> I'm not sure why we reached this conclusion. We (Beam) does not use BQ >>>> schema auto detection feature currently. So why not just send an API >>>> signal to get the schema when users are writing to existing tables ? Also, >>>> even if we decide to support schema auto detection in the future we will >>>> not be able to support this for BYTEs type (due to the restriction by BQ). >>>> >>>> >>>>> Thanks all for your input on this! >>>>> Juta >>>>> >>>>>
