Re: Writing bytes to BigQuery with beam

Reuven Lax Wed, 20 Mar 2019 12:46:07 -0700

The Java SDK relies on Jackson to do the encoding.

On Wed, Mar 20, 2019 at 11:33 AM Chamikara Jayalath <[email protected]>
wrote:


>
>
> On Wed, Mar 20, 2019 at 5:46 AM Juta Staes <[email protected]> wrote:
>
>> Hi all,
>>
>>
>> I am working on porting beam to python 3 and discovered the following:
>>
>>
>> Current handling of bytes in bigquery IO:
>>
>> When writing bytes to BQ , beam uses
>> https://cloud.google.com/bigquery/docs/reference/rest/v2/. This API
>> expects byte values to be base-64 encoded*.
>>
>> However when writing raw bytes they are currently never transformed to
>> base-64 encoded strings. This results in the following errors:
>>
>>    -
>>
>>    When writing b’abc’ in python 2 this results in actually writing
>>    b'i\xb7' which is the same as base64.b64decode('abc='))
>>    -
>>
>>    When writing b’abc’ in python 3 this results in “TypeError: b'abc' is
>>    not JSON serializable”
>>    -
>>
>>    When writing b’\xab’ in py2/py3 this gives a “ValueError: 'utf8'
>>    codec can't decode byte 0xab in position 0: invalid start byte. NAN, INF
>>    and -INF values are not JSON compliant”
>>    -
>>
>>    When reading bytes from BQ they are currently returned as base-64
>>    encoded strings rather then the raw bytes.
>>
>>
>> Example code:
>> https://docs.google.com/document/d/19zvDycWzF82MmtCmxrhqqyXKaRq8slRIjdxE6E8MObA/edit?usp=sharing
>>
>> There is also another issue when writing base-64 encoded string to BQ.
>> When no schema is specified this results in “Invalid schema update. Field
>> bytes has changed type from BYTES to STRING”.
>>
>> This error can be reproduced when uploading a file (directly in the BQ
>> UI) to a table with bytes and using schema autodetect.
>>
>> Suggested solution:
>>
>> I suggest to change BigQuery IO to handle the base-64 encoding as follows
>> to allow the user to read and write raw bytes in BQ
>>
>> Writing data:
>>
>>    -
>>
>>    When a new table is created we use the provided schema to detect
>>    bytes and handle the base-64 encoding accordingly
>>    -
>>
>>    When data is written to an existing table we use the API to get the
>>    schema of the table and handle the base-64 encoding accordingly. We also
>>    pass the schema as argument to avoid the error from schema autodetect.
>>
>> Reading data:
>>
>>    -
>>
>>    When reading data we also request the schema and handle the base-64
>>    decoding accordingly to return raw bytes
>>
>>
>> What are your thoughts on this?
>>
>
> Thanks for the update. More context here:
> https://issues.apache.org/jira/browse/BEAM-6769
>
> Suggested solution sounds good to me. BTW do you know how Java SDK handles
> bytes type ? I believe we write JSON files and execute load jobs there as
> well (when method is FILE_LOADS).
>
> Thanks,
> Cham
>
>
>>
>> *I could not find this in the documentation of the API or in the
>> documentation of BigQuery itself which also expects base-64 encoded values.
>> I discovered this when uploading a file to BQ UI and getting an error:
>> "Could not decode base64 string to bytes."
>>
>>
>> --
>>
>> [image: https://ml6.eu] <https://ml6.eu/>
>>
>> * Juta Staes*
>> ML6 Gent
>> <https://www.google.be/maps/place/ML6/@51.037408,3.7044893,17z/data=!3m1!4b1!4m5!3m4!1s0x47c37161feeca14b:0xb8f72585fdd21c90!8m2!3d51.037408!4d3.706678?hl=nl>
>>
>> **** DISCLAIMER ****
>> This email and any files transmitted with it are confidential and
>> intended solely for the use of the individual or entity to whom they are
>> addressed. If you have received this email in error please notify the
>> system manager. This message contains confidential information and is
>> intended only for the individual named. If you are not the named addressee
>> you should not disseminate, distribute or copy this e-mail. Please notify
>> the sender immediately by e-mail if you have received this e-mail by
>> mistake and delete this e-mail from your system. If you are not the
>> intended recipient you are notified that disclosing, copying, distributing
>> or taking any action in reliance on the contents of this information is
>> strictly prohibited.
>>
>

Re: Writing bytes to BigQuery with beam

Reply via email to