The Java SDK relies on Jackson to do the encoding. On Wed, Mar 20, 2019 at 11:33 AM Chamikara Jayalath <[email protected]> wrote:
> > > On Wed, Mar 20, 2019 at 5:46 AM Juta Staes <[email protected]> wrote: > >> Hi all, >> >> >> I am working on porting beam to python 3 and discovered the following: >> >> >> Current handling of bytes in bigquery IO: >> >> When writing bytes to BQ , beam uses >> https://cloud.google.com/bigquery/docs/reference/rest/v2/. This API >> expects byte values to be base-64 encoded*. >> >> However when writing raw bytes they are currently never transformed to >> base-64 encoded strings. This results in the following errors: >> >> - >> >> When writing b’abc’ in python 2 this results in actually writing >> b'i\xb7' which is the same as base64.b64decode('abc=')) >> - >> >> When writing b’abc’ in python 3 this results in “TypeError: b'abc' is >> not JSON serializable” >> - >> >> When writing b’\xab’ in py2/py3 this gives a “ValueError: 'utf8' >> codec can't decode byte 0xab in position 0: invalid start byte. NAN, INF >> and -INF values are not JSON compliant” >> - >> >> When reading bytes from BQ they are currently returned as base-64 >> encoded strings rather then the raw bytes. >> >> >> Example code: >> https://docs.google.com/document/d/19zvDycWzF82MmtCmxrhqqyXKaRq8slRIjdxE6E8MObA/edit?usp=sharing >> >> There is also another issue when writing base-64 encoded string to BQ. >> When no schema is specified this results in “Invalid schema update. Field >> bytes has changed type from BYTES to STRING”. >> >> This error can be reproduced when uploading a file (directly in the BQ >> UI) to a table with bytes and using schema autodetect. >> >> Suggested solution: >> >> I suggest to change BigQuery IO to handle the base-64 encoding as follows >> to allow the user to read and write raw bytes in BQ >> >> Writing data: >> >> - >> >> When a new table is created we use the provided schema to detect >> bytes and handle the base-64 encoding accordingly >> - >> >> When data is written to an existing table we use the API to get the >> schema of the table and handle the base-64 encoding accordingly. We also >> pass the schema as argument to avoid the error from schema autodetect. >> >> Reading data: >> >> - >> >> When reading data we also request the schema and handle the base-64 >> decoding accordingly to return raw bytes >> >> >> What are your thoughts on this? >> > > Thanks for the update. More context here: > https://issues.apache.org/jira/browse/BEAM-6769 > > Suggested solution sounds good to me. BTW do you know how Java SDK handles > bytes type ? I believe we write JSON files and execute load jobs there as > well (when method is FILE_LOADS). > > Thanks, > Cham > > >> >> *I could not find this in the documentation of the API or in the >> documentation of BigQuery itself which also expects base-64 encoded values. >> I discovered this when uploading a file to BQ UI and getting an error: >> "Could not decode base64 string to bytes." >> >> >> -- >> >> [image: https://ml6.eu] <https://ml6.eu/> >> >> * Juta Staes* >> ML6 Gent >> <https://www.google.be/maps/place/ML6/@51.037408,3.7044893,17z/data=!3m1!4b1!4m5!3m4!1s0x47c37161feeca14b:0xb8f72585fdd21c90!8m2!3d51.037408!4d3.706678?hl=nl> >> >> **** DISCLAIMER **** >> This email and any files transmitted with it are confidential and >> intended solely for the use of the individual or entity to whom they are >> addressed. If you have received this email in error please notify the >> system manager. This message contains confidential information and is >> intended only for the individual named. If you are not the named addressee >> you should not disseminate, distribute or copy this e-mail. Please notify >> the sender immediately by e-mail if you have received this e-mail by >> mistake and delete this e-mail from your system. If you are not the >> intended recipient you are notified that disclosing, copying, distributing >> or taking any action in reliance on the contents of this information is >> strictly prohibited. >> >
