Re: Writing bytes to BigQuery with beam

Chamikara Jayalath Wed, 20 Mar 2019 20:24:13 -0700

On Wed, Mar 20, 2019 at 7:37 PM Valentyn Tymofieiev <[email protected]>
wrote:


> Pablo, according to Juta's analysis (1.c in the document) and also
> https://issuetracker.google.com/issues/129006689, I think BQ confuses
> BYTES and STRING when schema is not specified... This seems to me like a BQ
> bug, so for Beam this means that we either have to wait until BQ fixes or,
> or work around it. If we work around it, we can ask users to always supply
> schema if their table has BYTES data (temporary limitation), or try to pull
> schema from BQ before (every?) write operation.
>
> Cham, according to BQ documentation, BQ *can* auto-detect schema when
> populating new tables using a data source, for example a json file with
> records : https://cloud.google.com/bigquery/docs/schema-detect.
>

Ah, we don't support that AFAIK so currently we require users to provide a
schema to create tables. But good point, in case if we ever want to support
that feature.


>
> On Wed, Mar 20, 2019 at 7:15 PM Chamikara Jayalath <[email protected]>
> wrote:
>
>>
>>
>> On Wed, Mar 20, 2019 at 6:30 PM Pablo Estrada <[email protected]> wrote:
>>
>>> That sounds reasonable to me, Valentyn.
>>>
>>> Regarding (3), when the table already exists, it's not necessary to get
>>> the schema. BQ is smart enough to load everything in appropriately. (as
>>> long as bytes fields are base64 encoded)
>>>
>>> The problem is when the table does not exist and the user does not
>>> provide a schema. In that case, there is no simple way of auto-inferring
>>> the schema, as you correctly point out. I think it's reasonable to simply
>>> expect users provide schemas if their data will have tricky types to infer.
>>> Best
>>> -P.
>>>
>>
>> Is this even an option ? I think when table is not available users have
>> to provide a schema to create a new table.
>>
>>
>>>
>>>
>>> On Wed, Mar 20, 2019 at 3:44 PM Valentyn Tymofieiev <[email protected]>
>>> wrote:
>>>
>>>> Thanks Juta for detailed analysis.
>>>>
>>>> I reached out to BigQuery team to improve documentation around
>>>> treatment of Bytes and reported the issue that schema autodetection does
>>>> not work <https://issuetracker.google.com/issues/129006689> for BYTES
>>>> in GCP issue tracker
>>>> <https://cloud.google.com/support/docs/issue-trackers>.
>>>>
>>>> Is this a correct summary of your proposal?
>>>>
>>>> 1. Beam will base64-encode raw bytes, before passing them to BQ over
>>>> rest API. This will be a change in behavior for Python 2 (for good 
>>>> reasons).
>>>> 2. When reading data from BQ, all fileds of type BYTES will be
>>>> base64-decoded.
>>>> 3. Beam will send an API call to BigQuery to get table schema, whenever
>>>> schema is not supplied, to work around
>>>> https://issuetracker.google.com/issues/129006689. Does anyone see any
>>>> concerns with this? Is it always possible?
>>>>
>>>> Thanks,
>>>> Valentyn
>>>>
>>>> On Wed, Mar 20, 2019 at 12:45 PM Reuven Lax <[email protected]> wrote:
>>>>
>>>>> The Java SDK relies on Jackson to do the encoding.
>>>>>
>>>>> On Wed, Mar 20, 2019 at 11:33 AM Chamikara Jayalath <
>>>>> [email protected]> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Mar 20, 2019 at 5:46 AM Juta Staes <[email protected]> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>>
>>>>>>> I am working on porting beam to python 3 and discovered the
>>>>>>> following:
>>>>>>>
>>>>>>>
>>>>>>> Current handling of bytes in bigquery IO:
>>>>>>>
>>>>>>> When writing bytes to BQ , beam uses
>>>>>>> https://cloud.google.com/bigquery/docs/reference/rest/v2/. This API
>>>>>>> expects byte values to be base-64 encoded*.
>>>>>>>
>>>>>>> However when writing raw bytes they are currently never transformed
>>>>>>> to base-64 encoded strings. This results in the following errors:
>>>>>>>
>>>>>>>    -
>>>>>>>
>>>>>>>    When writing b’abc’ in python 2 this results in actually writing
>>>>>>>    b'i\xb7' which is the same as base64.b64decode('abc='))
>>>>>>>    -
>>>>>>>
>>>>>>>    When writing b’abc’ in python 3 this results in “TypeError:
>>>>>>>    b'abc' is not JSON serializable”
>>>>>>>    -
>>>>>>>
>>>>>>>    When writing b’\xab’ in py2/py3 this gives a “ValueError: 'utf8'
>>>>>>>    codec can't decode byte 0xab in position 0: invalid start byte. NAN, 
>>>>>>> INF
>>>>>>>    and -INF values are not JSON compliant”
>>>>>>>    -
>>>>>>>
>>>>>>>    When reading bytes from BQ they are currently returned as
>>>>>>>    base-64 encoded strings rather then the raw bytes.
>>>>>>>
>>>>>>>
>>>>>>> Example code:
>>>>>>> https://docs.google.com/document/d/19zvDycWzF82MmtCmxrhqqyXKaRq8slRIjdxE6E8MObA/edit?usp=sharing
>>>>>>>
>>>>>>> There is also another issue when writing base-64 encoded string to
>>>>>>> BQ. When no schema is specified this results in “Invalid schema update.
>>>>>>> Field bytes has changed type from BYTES to STRING”.
>>>>>>>
>>>>>>> This error can be reproduced when uploading a file (directly in the
>>>>>>> BQ UI) to a table with bytes and using schema autodetect.
>>>>>>>
>>>>>>> Suggested solution:
>>>>>>>
>>>>>>> I suggest to change BigQuery IO to handle the base-64 encoding as
>>>>>>> follows to allow the user to read and write raw bytes in BQ
>>>>>>>
>>>>>>> Writing data:
>>>>>>>
>>>>>>>    -
>>>>>>>
>>>>>>>    When a new table is created we use the provided schema to detect
>>>>>>>    bytes and handle the base-64 encoding accordingly
>>>>>>>    -
>>>>>>>
>>>>>>>    When data is written to an existing table we use the API to get
>>>>>>>    the schema of the table and handle the base-64 encoding accordingly. 
>>>>>>> We
>>>>>>>    also pass the schema as argument to avoid the error from schema 
>>>>>>> autodetect.
>>>>>>>
>>>>>>> Reading data:
>>>>>>>
>>>>>>>    -
>>>>>>>
>>>>>>>    When reading data we also request the schema and handle the
>>>>>>>    base-64 decoding accordingly to return raw bytes
>>>>>>>
>>>>>>>
>>>>>>> What are your thoughts on this?
>>>>>>>
>>>>>>
>>>>>> Thanks for the update. More context here:
>>>>>> https://issues.apache.org/jira/browse/BEAM-6769
>>>>>>
>>>>>> Suggested solution sounds good to me. BTW do you know how Java SDK
>>>>>> handles bytes type ? I believe we write JSON files and execute load jobs
>>>>>> there as well (when method is FILE_LOADS).
>>>>>>
>>>>>> Thanks,
>>>>>> Cham
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> *I could not find this in the documentation of the API or in the
>>>>>>> documentation of BigQuery itself which also expects base-64 encoded 
>>>>>>> values.
>>>>>>> I discovered this when uploading a file to BQ UI and getting an error:
>>>>>>> "Could not decode base64 string to bytes."
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> [image: https://ml6.eu] <https://ml6.eu/>
>>>>>>>
>>>>>>> * Juta Staes*
>>>>>>> ML6 Gent
>>>>>>> <https://www.google.be/maps/place/ML6/@51.037408,3.7044893,17z/data=!3m1!4b1!4m5!3m4!1s0x47c37161feeca14b:0xb8f72585fdd21c90!8m2!3d51.037408!4d3.706678?hl=nl>
>>>>>>>
>>>>>>> **** DISCLAIMER ****
>>>>>>> This email and any files transmitted with it are confidential and
>>>>>>> intended solely for the use of the individual or entity to whom they are
>>>>>>> addressed. If you have received this email in error please notify the
>>>>>>> system manager. This message contains confidential information and is
>>>>>>> intended only for the individual named. If you are not the named 
>>>>>>> addressee
>>>>>>> you should not disseminate, distribute or copy this e-mail. Please 
>>>>>>> notify
>>>>>>> the sender immediately by e-mail if you have received this e-mail by
>>>>>>> mistake and delete this e-mail from your system. If you are not the
>>>>>>> intended recipient you are notified that disclosing, copying, 
>>>>>>> distributing
>>>>>>> or taking any action in reliance on the contents of this information is
>>>>>>> strictly prohibited.
>>>>>>>
>>>>>>

Re: Writing bytes to BigQuery with beam

Reply via email to