Hi all,

I am working on porting beam to python 3 and discovered the following:


Current handling of bytes in bigquery IO:

When writing bytes to BQ , beam uses
https://cloud.google.com/bigquery/docs/reference/rest/v2/. This API expects
byte values to be base-64 encoded*.

However when writing raw bytes they are currently never transformed to
base-64 encoded strings. This results in the following errors:

   -

   When writing b’abc’ in python 2 this results in actually writing
   b'i\xb7' which is the same as base64.b64decode('abc='))
   -

   When writing b’abc’ in python 3 this results in “TypeError: b'abc' is
   not JSON serializable”
   -

   When writing b’\xab’ in py2/py3 this gives a “ValueError: 'utf8' codec
   can't decode byte 0xab in position 0: invalid start byte. NAN, INF and -INF
   values are not JSON compliant”
   -

   When reading bytes from BQ they are currently returned as base-64
   encoded strings rather then the raw bytes.


Example code:
https://docs.google.com/document/d/19zvDycWzF82MmtCmxrhqqyXKaRq8slRIjdxE6E8MObA/edit?usp=sharing

There is also another issue when writing base-64 encoded string to BQ. When
no schema is specified this results in “Invalid schema update. Field bytes
has changed type from BYTES to STRING”.

This error can be reproduced when uploading a file (directly in the BQ UI)
to a table with bytes and using schema autodetect.

Suggested solution:

I suggest to change BigQuery IO to handle the base-64 encoding as follows
to allow the user to read and write raw bytes in BQ

Writing data:

   -

   When a new table is created we use the provided schema to detect bytes
   and handle the base-64 encoding accordingly
   -

   When data is written to an existing table we use the API to get the
   schema of the table and handle the base-64 encoding accordingly. We also
   pass the schema as argument to avoid the error from schema autodetect.

Reading data:

   -

   When reading data we also request the schema and handle the base-64
   decoding accordingly to return raw bytes


What are your thoughts on this?

*I could not find this in the documentation of the API or in the
documentation of BigQuery itself which also expects base-64 encoded values.
I discovered this when uploading a file to BQ UI and getting an error:
"Could not decode base64 string to bytes."


-- 

[image: https://ml6.eu] <https://ml6.eu/>

* Juta Staes*
ML6 Gent
<https://www.google.be/maps/place/ML6/@51.037408,3.7044893,17z/data=!3m1!4b1!4m5!3m4!1s0x47c37161feeca14b:0xb8f72585fdd21c90!8m2!3d51.037408!4d3.706678?hl=nl>

**** DISCLAIMER ****
This email and any files transmitted with it are confidential and intended
solely for the use of the individual or entity to whom they are addressed.
If you have received this email in error please notify the system manager.
This message contains confidential information and is intended only for the
individual named. If you are not the named addressee you should not
disseminate, distribute or copy this e-mail. Please notify the sender
immediately by e-mail if you have received this e-mail by mistake and
delete this e-mail from your system. If you are not the intended recipient
you are notified that disclosing, copying, distributing or taking any
action in reliance on the contents of this information is strictly
prohibited.

Reply via email to