On Wed, Mar 20, 2019 at 7:37 PM Valentyn Tymofieiev <[email protected]> wrote:
> Pablo, according to Juta's analysis (1.c in the document) and also > https://issuetracker.google.com/issues/129006689, I think BQ confuses > BYTES and STRING when schema is not specified... This seems to me like a BQ > bug, so for Beam this means that we either have to wait until BQ fixes or, > or work around it. If we work around it, we can ask users to always supply > schema if their table has BYTES data (temporary limitation), or try to pull > schema from BQ before (every?) write operation. > > Cham, according to BQ documentation, BQ *can* auto-detect schema when > populating new tables using a data source, for example a json file with > records : https://cloud.google.com/bigquery/docs/schema-detect. > Ah, we don't support that AFAIK so currently we require users to provide a schema to create tables. But good point, in case if we ever want to support that feature. > > On Wed, Mar 20, 2019 at 7:15 PM Chamikara Jayalath <[email protected]> > wrote: > >> >> >> On Wed, Mar 20, 2019 at 6:30 PM Pablo Estrada <[email protected]> wrote: >> >>> That sounds reasonable to me, Valentyn. >>> >>> Regarding (3), when the table already exists, it's not necessary to get >>> the schema. BQ is smart enough to load everything in appropriately. (as >>> long as bytes fields are base64 encoded) >>> >>> The problem is when the table does not exist and the user does not >>> provide a schema. In that case, there is no simple way of auto-inferring >>> the schema, as you correctly point out. I think it's reasonable to simply >>> expect users provide schemas if their data will have tricky types to infer. >>> Best >>> -P. >>> >> >> Is this even an option ? I think when table is not available users have >> to provide a schema to create a new table. >> >> >>> >>> >>> On Wed, Mar 20, 2019 at 3:44 PM Valentyn Tymofieiev <[email protected]> >>> wrote: >>> >>>> Thanks Juta for detailed analysis. >>>> >>>> I reached out to BigQuery team to improve documentation around >>>> treatment of Bytes and reported the issue that schema autodetection does >>>> not work <https://issuetracker.google.com/issues/129006689> for BYTES >>>> in GCP issue tracker >>>> <https://cloud.google.com/support/docs/issue-trackers>. >>>> >>>> Is this a correct summary of your proposal? >>>> >>>> 1. Beam will base64-encode raw bytes, before passing them to BQ over >>>> rest API. This will be a change in behavior for Python 2 (for good >>>> reasons). >>>> 2. When reading data from BQ, all fileds of type BYTES will be >>>> base64-decoded. >>>> 3. Beam will send an API call to BigQuery to get table schema, whenever >>>> schema is not supplied, to work around >>>> https://issuetracker.google.com/issues/129006689. Does anyone see any >>>> concerns with this? Is it always possible? >>>> >>>> Thanks, >>>> Valentyn >>>> >>>> On Wed, Mar 20, 2019 at 12:45 PM Reuven Lax <[email protected]> wrote: >>>> >>>>> The Java SDK relies on Jackson to do the encoding. >>>>> >>>>> On Wed, Mar 20, 2019 at 11:33 AM Chamikara Jayalath < >>>>> [email protected]> wrote: >>>>> >>>>>> >>>>>> >>>>>> On Wed, Mar 20, 2019 at 5:46 AM Juta Staes <[email protected]> wrote: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> >>>>>>> I am working on porting beam to python 3 and discovered the >>>>>>> following: >>>>>>> >>>>>>> >>>>>>> Current handling of bytes in bigquery IO: >>>>>>> >>>>>>> When writing bytes to BQ , beam uses >>>>>>> https://cloud.google.com/bigquery/docs/reference/rest/v2/. This API >>>>>>> expects byte values to be base-64 encoded*. >>>>>>> >>>>>>> However when writing raw bytes they are currently never transformed >>>>>>> to base-64 encoded strings. This results in the following errors: >>>>>>> >>>>>>> - >>>>>>> >>>>>>> When writing b’abc’ in python 2 this results in actually writing >>>>>>> b'i\xb7' which is the same as base64.b64decode('abc=')) >>>>>>> - >>>>>>> >>>>>>> When writing b’abc’ in python 3 this results in “TypeError: >>>>>>> b'abc' is not JSON serializable” >>>>>>> - >>>>>>> >>>>>>> When writing b’\xab’ in py2/py3 this gives a “ValueError: 'utf8' >>>>>>> codec can't decode byte 0xab in position 0: invalid start byte. NAN, >>>>>>> INF >>>>>>> and -INF values are not JSON compliant” >>>>>>> - >>>>>>> >>>>>>> When reading bytes from BQ they are currently returned as >>>>>>> base-64 encoded strings rather then the raw bytes. >>>>>>> >>>>>>> >>>>>>> Example code: >>>>>>> https://docs.google.com/document/d/19zvDycWzF82MmtCmxrhqqyXKaRq8slRIjdxE6E8MObA/edit?usp=sharing >>>>>>> >>>>>>> There is also another issue when writing base-64 encoded string to >>>>>>> BQ. When no schema is specified this results in “Invalid schema update. >>>>>>> Field bytes has changed type from BYTES to STRING”. >>>>>>> >>>>>>> This error can be reproduced when uploading a file (directly in the >>>>>>> BQ UI) to a table with bytes and using schema autodetect. >>>>>>> >>>>>>> Suggested solution: >>>>>>> >>>>>>> I suggest to change BigQuery IO to handle the base-64 encoding as >>>>>>> follows to allow the user to read and write raw bytes in BQ >>>>>>> >>>>>>> Writing data: >>>>>>> >>>>>>> - >>>>>>> >>>>>>> When a new table is created we use the provided schema to detect >>>>>>> bytes and handle the base-64 encoding accordingly >>>>>>> - >>>>>>> >>>>>>> When data is written to an existing table we use the API to get >>>>>>> the schema of the table and handle the base-64 encoding accordingly. >>>>>>> We >>>>>>> also pass the schema as argument to avoid the error from schema >>>>>>> autodetect. >>>>>>> >>>>>>> Reading data: >>>>>>> >>>>>>> - >>>>>>> >>>>>>> When reading data we also request the schema and handle the >>>>>>> base-64 decoding accordingly to return raw bytes >>>>>>> >>>>>>> >>>>>>> What are your thoughts on this? >>>>>>> >>>>>> >>>>>> Thanks for the update. More context here: >>>>>> https://issues.apache.org/jira/browse/BEAM-6769 >>>>>> >>>>>> Suggested solution sounds good to me. BTW do you know how Java SDK >>>>>> handles bytes type ? I believe we write JSON files and execute load jobs >>>>>> there as well (when method is FILE_LOADS). >>>>>> >>>>>> Thanks, >>>>>> Cham >>>>>> >>>>>> >>>>>>> >>>>>>> *I could not find this in the documentation of the API or in the >>>>>>> documentation of BigQuery itself which also expects base-64 encoded >>>>>>> values. >>>>>>> I discovered this when uploading a file to BQ UI and getting an error: >>>>>>> "Could not decode base64 string to bytes." >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> [image: https://ml6.eu] <https://ml6.eu/> >>>>>>> >>>>>>> * Juta Staes* >>>>>>> ML6 Gent >>>>>>> <https://www.google.be/maps/place/ML6/@51.037408,3.7044893,17z/data=!3m1!4b1!4m5!3m4!1s0x47c37161feeca14b:0xb8f72585fdd21c90!8m2!3d51.037408!4d3.706678?hl=nl> >>>>>>> >>>>>>> **** DISCLAIMER **** >>>>>>> This email and any files transmitted with it are confidential and >>>>>>> intended solely for the use of the individual or entity to whom they are >>>>>>> addressed. If you have received this email in error please notify the >>>>>>> system manager. This message contains confidential information and is >>>>>>> intended only for the individual named. If you are not the named >>>>>>> addressee >>>>>>> you should not disseminate, distribute or copy this e-mail. Please >>>>>>> notify >>>>>>> the sender immediately by e-mail if you have received this e-mail by >>>>>>> mistake and delete this e-mail from your system. If you are not the >>>>>>> intended recipient you are notified that disclosing, copying, >>>>>>> distributing >>>>>>> or taking any action in reliance on the contents of this information is >>>>>>> strictly prohibited. >>>>>>> >>>>>>
