Pablo, according to Juta's analysis (1.c in the document) and also https://issuetracker.google.com/issues/129006689, I think BQ confuses BYTES and STRING when schema is not specified... This seems to me like a BQ bug, so for Beam this means that we either have to wait until BQ fixes or, or work around it. If we work around it, we can ask users to always supply schema if their table has BYTES data (temporary limitation), or try to pull schema from BQ before (every?) write operation.
Cham, according to BQ documentation, BQ *can* auto-detect schema when populating new tables using a data source, for example a json file with records : https://cloud.google.com/bigquery/docs/schema-detect. On Wed, Mar 20, 2019 at 7:15 PM Chamikara Jayalath <[email protected]> wrote: > > > On Wed, Mar 20, 2019 at 6:30 PM Pablo Estrada <[email protected]> wrote: > >> That sounds reasonable to me, Valentyn. >> >> Regarding (3), when the table already exists, it's not necessary to get >> the schema. BQ is smart enough to load everything in appropriately. (as >> long as bytes fields are base64 encoded) >> >> The problem is when the table does not exist and the user does not >> provide a schema. In that case, there is no simple way of auto-inferring >> the schema, as you correctly point out. I think it's reasonable to simply >> expect users provide schemas if their data will have tricky types to infer. >> Best >> -P. >> > > Is this even an option ? I think when table is not available users have to > provide a schema to create a new table. > > >> >> >> On Wed, Mar 20, 2019 at 3:44 PM Valentyn Tymofieiev <[email protected]> >> wrote: >> >>> Thanks Juta for detailed analysis. >>> >>> I reached out to BigQuery team to improve documentation around treatment >>> of Bytes and reported the issue that schema autodetection does not work >>> <https://issuetracker.google.com/issues/129006689> for BYTES in GCP >>> issue tracker <https://cloud.google.com/support/docs/issue-trackers>. >>> >>> Is this a correct summary of your proposal? >>> >>> 1. Beam will base64-encode raw bytes, before passing them to BQ over >>> rest API. This will be a change in behavior for Python 2 (for good reasons). >>> 2. When reading data from BQ, all fileds of type BYTES will be >>> base64-decoded. >>> 3. Beam will send an API call to BigQuery to get table schema, whenever >>> schema is not supplied, to work around >>> https://issuetracker.google.com/issues/129006689. Does anyone see any >>> concerns with this? Is it always possible? >>> >>> Thanks, >>> Valentyn >>> >>> On Wed, Mar 20, 2019 at 12:45 PM Reuven Lax <[email protected]> wrote: >>> >>>> The Java SDK relies on Jackson to do the encoding. >>>> >>>> On Wed, Mar 20, 2019 at 11:33 AM Chamikara Jayalath < >>>> [email protected]> wrote: >>>> >>>>> >>>>> >>>>> On Wed, Mar 20, 2019 at 5:46 AM Juta Staes <[email protected]> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> >>>>>> I am working on porting beam to python 3 and discovered the following: >>>>>> >>>>>> >>>>>> Current handling of bytes in bigquery IO: >>>>>> >>>>>> When writing bytes to BQ , beam uses >>>>>> https://cloud.google.com/bigquery/docs/reference/rest/v2/. This API >>>>>> expects byte values to be base-64 encoded*. >>>>>> >>>>>> However when writing raw bytes they are currently never transformed >>>>>> to base-64 encoded strings. This results in the following errors: >>>>>> >>>>>> - >>>>>> >>>>>> When writing b’abc’ in python 2 this results in actually writing >>>>>> b'i\xb7' which is the same as base64.b64decode('abc=')) >>>>>> - >>>>>> >>>>>> When writing b’abc’ in python 3 this results in “TypeError: >>>>>> b'abc' is not JSON serializable” >>>>>> - >>>>>> >>>>>> When writing b’\xab’ in py2/py3 this gives a “ValueError: 'utf8' >>>>>> codec can't decode byte 0xab in position 0: invalid start byte. NAN, >>>>>> INF >>>>>> and -INF values are not JSON compliant” >>>>>> - >>>>>> >>>>>> When reading bytes from BQ they are currently returned as base-64 >>>>>> encoded strings rather then the raw bytes. >>>>>> >>>>>> >>>>>> Example code: >>>>>> https://docs.google.com/document/d/19zvDycWzF82MmtCmxrhqqyXKaRq8slRIjdxE6E8MObA/edit?usp=sharing >>>>>> >>>>>> There is also another issue when writing base-64 encoded string to >>>>>> BQ. When no schema is specified this results in “Invalid schema update. >>>>>> Field bytes has changed type from BYTES to STRING”. >>>>>> >>>>>> This error can be reproduced when uploading a file (directly in the >>>>>> BQ UI) to a table with bytes and using schema autodetect. >>>>>> >>>>>> Suggested solution: >>>>>> >>>>>> I suggest to change BigQuery IO to handle the base-64 encoding as >>>>>> follows to allow the user to read and write raw bytes in BQ >>>>>> >>>>>> Writing data: >>>>>> >>>>>> - >>>>>> >>>>>> When a new table is created we use the provided schema to detect >>>>>> bytes and handle the base-64 encoding accordingly >>>>>> - >>>>>> >>>>>> When data is written to an existing table we use the API to get >>>>>> the schema of the table and handle the base-64 encoding accordingly. >>>>>> We >>>>>> also pass the schema as argument to avoid the error from schema >>>>>> autodetect. >>>>>> >>>>>> Reading data: >>>>>> >>>>>> - >>>>>> >>>>>> When reading data we also request the schema and handle the >>>>>> base-64 decoding accordingly to return raw bytes >>>>>> >>>>>> >>>>>> What are your thoughts on this? >>>>>> >>>>> >>>>> Thanks for the update. More context here: >>>>> https://issues.apache.org/jira/browse/BEAM-6769 >>>>> >>>>> Suggested solution sounds good to me. BTW do you know how Java SDK >>>>> handles bytes type ? I believe we write JSON files and execute load jobs >>>>> there as well (when method is FILE_LOADS). >>>>> >>>>> Thanks, >>>>> Cham >>>>> >>>>> >>>>>> >>>>>> *I could not find this in the documentation of the API or in the >>>>>> documentation of BigQuery itself which also expects base-64 encoded >>>>>> values. >>>>>> I discovered this when uploading a file to BQ UI and getting an error: >>>>>> "Could not decode base64 string to bytes." >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> [image: https://ml6.eu] <https://ml6.eu/> >>>>>> >>>>>> * Juta Staes* >>>>>> ML6 Gent >>>>>> <https://www.google.be/maps/place/ML6/@51.037408,3.7044893,17z/data=!3m1!4b1!4m5!3m4!1s0x47c37161feeca14b:0xb8f72585fdd21c90!8m2!3d51.037408!4d3.706678?hl=nl> >>>>>> >>>>>> **** DISCLAIMER **** >>>>>> This email and any files transmitted with it are confidential and >>>>>> intended solely for the use of the individual or entity to whom they are >>>>>> addressed. If you have received this email in error please notify the >>>>>> system manager. This message contains confidential information and is >>>>>> intended only for the individual named. If you are not the named >>>>>> addressee >>>>>> you should not disseminate, distribute or copy this e-mail. Please notify >>>>>> the sender immediately by e-mail if you have received this e-mail by >>>>>> mistake and delete this e-mail from your system. If you are not the >>>>>> intended recipient you are notified that disclosing, copying, >>>>>> distributing >>>>>> or taking any action in reliance on the contents of this information is >>>>>> strictly prohibited. >>>>>> >>>>>
