Thanks Juta for detailed analysis. I reached out to BigQuery team to improve documentation around treatment of Bytes and reported the issue that schema autodetection does not work <https://issuetracker.google.com/issues/129006689> for BYTES in GCP issue tracker <https://cloud.google.com/support/docs/issue-trackers>.
Is this a correct summary of your proposal? 1. Beam will base64-encode raw bytes, before passing them to BQ over rest API. This will be a change in behavior for Python 2 (for good reasons). 2. When reading data from BQ, all fileds of type BYTES will be base64-decoded. 3. Beam will send an API call to BigQuery to get table schema, whenever schema is not supplied, to work around https://issuetracker.google.com/issues/129006689. Does anyone see any concerns with this? Is it always possible? Thanks, Valentyn On Wed, Mar 20, 2019 at 12:45 PM Reuven Lax <[email protected]> wrote: > The Java SDK relies on Jackson to do the encoding. > > On Wed, Mar 20, 2019 at 11:33 AM Chamikara Jayalath <[email protected]> > wrote: > >> >> >> On Wed, Mar 20, 2019 at 5:46 AM Juta Staes <[email protected]> wrote: >> >>> Hi all, >>> >>> >>> I am working on porting beam to python 3 and discovered the following: >>> >>> >>> Current handling of bytes in bigquery IO: >>> >>> When writing bytes to BQ , beam uses >>> https://cloud.google.com/bigquery/docs/reference/rest/v2/. This API >>> expects byte values to be base-64 encoded*. >>> >>> However when writing raw bytes they are currently never transformed to >>> base-64 encoded strings. This results in the following errors: >>> >>> - >>> >>> When writing b’abc’ in python 2 this results in actually writing >>> b'i\xb7' which is the same as base64.b64decode('abc=')) >>> - >>> >>> When writing b’abc’ in python 3 this results in “TypeError: b'abc' >>> is not JSON serializable” >>> - >>> >>> When writing b’\xab’ in py2/py3 this gives a “ValueError: 'utf8' >>> codec can't decode byte 0xab in position 0: invalid start byte. NAN, INF >>> and -INF values are not JSON compliant” >>> - >>> >>> When reading bytes from BQ they are currently returned as base-64 >>> encoded strings rather then the raw bytes. >>> >>> >>> Example code: >>> https://docs.google.com/document/d/19zvDycWzF82MmtCmxrhqqyXKaRq8slRIjdxE6E8MObA/edit?usp=sharing >>> >>> There is also another issue when writing base-64 encoded string to BQ. >>> When no schema is specified this results in “Invalid schema update. Field >>> bytes has changed type from BYTES to STRING”. >>> >>> This error can be reproduced when uploading a file (directly in the BQ >>> UI) to a table with bytes and using schema autodetect. >>> >>> Suggested solution: >>> >>> I suggest to change BigQuery IO to handle the base-64 encoding as >>> follows to allow the user to read and write raw bytes in BQ >>> >>> Writing data: >>> >>> - >>> >>> When a new table is created we use the provided schema to detect >>> bytes and handle the base-64 encoding accordingly >>> - >>> >>> When data is written to an existing table we use the API to get the >>> schema of the table and handle the base-64 encoding accordingly. We also >>> pass the schema as argument to avoid the error from schema autodetect. >>> >>> Reading data: >>> >>> - >>> >>> When reading data we also request the schema and handle the base-64 >>> decoding accordingly to return raw bytes >>> >>> >>> What are your thoughts on this? >>> >> >> Thanks for the update. More context here: >> https://issues.apache.org/jira/browse/BEAM-6769 >> >> Suggested solution sounds good to me. BTW do you know how Java SDK >> handles bytes type ? I believe we write JSON files and execute load jobs >> there as well (when method is FILE_LOADS). >> >> Thanks, >> Cham >> >> >>> >>> *I could not find this in the documentation of the API or in the >>> documentation of BigQuery itself which also expects base-64 encoded values. >>> I discovered this when uploading a file to BQ UI and getting an error: >>> "Could not decode base64 string to bytes." >>> >>> >>> -- >>> >>> [image: https://ml6.eu] <https://ml6.eu/> >>> >>> * Juta Staes* >>> ML6 Gent >>> <https://www.google.be/maps/place/ML6/@51.037408,3.7044893,17z/data=!3m1!4b1!4m5!3m4!1s0x47c37161feeca14b:0xb8f72585fdd21c90!8m2!3d51.037408!4d3.706678?hl=nl> >>> >>> **** DISCLAIMER **** >>> This email and any files transmitted with it are confidential and >>> intended solely for the use of the individual or entity to whom they are >>> addressed. If you have received this email in error please notify the >>> system manager. This message contains confidential information and is >>> intended only for the individual named. If you are not the named addressee >>> you should not disseminate, distribute or copy this e-mail. Please notify >>> the sender immediately by e-mail if you have received this e-mail by >>> mistake and delete this e-mail from your system. If you are not the >>> intended recipient you are notified that disclosing, copying, distributing >>> or taking any action in reliance on the contents of this information is >>> strictly prohibited. >>> >>
