Hi all,
I am working on porting beam to python 3 and discovered the following: Current handling of bytes in bigquery IO: When writing bytes to BQ , beam uses https://cloud.google.com/bigquery/docs/reference/rest/v2/. This API expects byte values to be base-64 encoded*. However when writing raw bytes they are currently never transformed to base-64 encoded strings. This results in the following errors: - When writing b’abc’ in python 2 this results in actually writing b'i\xb7' which is the same as base64.b64decode('abc=')) - When writing b’abc’ in python 3 this results in “TypeError: b'abc' is not JSON serializable” - When writing b’\xab’ in py2/py3 this gives a “ValueError: 'utf8' codec can't decode byte 0xab in position 0: invalid start byte. NAN, INF and -INF values are not JSON compliant” - When reading bytes from BQ they are currently returned as base-64 encoded strings rather then the raw bytes. Example code: https://docs.google.com/document/d/19zvDycWzF82MmtCmxrhqqyXKaRq8slRIjdxE6E8MObA/edit?usp=sharing There is also another issue when writing base-64 encoded string to BQ. When no schema is specified this results in “Invalid schema update. Field bytes has changed type from BYTES to STRING”. This error can be reproduced when uploading a file (directly in the BQ UI) to a table with bytes and using schema autodetect. Suggested solution: I suggest to change BigQuery IO to handle the base-64 encoding as follows to allow the user to read and write raw bytes in BQ Writing data: - When a new table is created we use the provided schema to detect bytes and handle the base-64 encoding accordingly - When data is written to an existing table we use the API to get the schema of the table and handle the base-64 encoding accordingly. We also pass the schema as argument to avoid the error from schema autodetect. Reading data: - When reading data we also request the schema and handle the base-64 decoding accordingly to return raw bytes What are your thoughts on this? *I could not find this in the documentation of the API or in the documentation of BigQuery itself which also expects base-64 encoded values. I discovered this when uploading a file to BQ UI and getting an error: "Could not decode base64 string to bytes." -- [image: https://ml6.eu] <https://ml6.eu/> * Juta Staes* ML6 Gent <https://www.google.be/maps/place/ML6/@51.037408,3.7044893,17z/data=!3m1!4b1!4m5!3m4!1s0x47c37161feeca14b:0xb8f72585fdd21c90!8m2!3d51.037408!4d3.706678?hl=nl> **** DISCLAIMER **** This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited.
