Re: Writing bytes to BigQuery with beam

Chamikara Jayalath Wed, 20 Mar 2019 19:16:04 -0700

On Wed, Mar 20, 2019 at 6:30 PM Pablo Estrada <pabl...@google.com> wrote:


> That sounds reasonable to me, Valentyn.
>
> Regarding (3), when the table already exists, it's not necessary to get
> the schema. BQ is smart enough to load everything in appropriately. (as
> long as bytes fields are base64 encoded)
>
> The problem is when the table does not exist and the user does not provide
> a schema. In that case, there is no simple way of auto-inferring the
> schema, as you correctly point out. I think it's reasonable to simply
> expect users provide schemas if their data will have tricky types to infer.
> Best
> -P.
>

Is this even an option ? I think when table is not available users have to
provide a schema to create a new table.


>
>
> On Wed, Mar 20, 2019 at 3:44 PM Valentyn Tymofieiev <valen...@google.com>
> wrote:
>
>> Thanks Juta for detailed analysis.
>>
>> I reached out to BigQuery team to improve documentation around treatment
>> of Bytes and reported the issue that schema autodetection does not work
>> <https://issuetracker.google.com/issues/129006689> for BYTES in GCP
>> issue tracker <https://cloud.google.com/support/docs/issue-trackers>.
>>
>> Is this a correct summary of your proposal?
>>
>> 1. Beam will base64-encode raw bytes, before passing them to BQ over rest
>> API. This will be a change in behavior for Python 2 (for good reasons).
>> 2. When reading data from BQ, all fileds of type BYTES will be
>> base64-decoded.
>> 3. Beam will send an API call to BigQuery to get table schema, whenever
>> schema is not supplied, to work around
>> https://issuetracker.google.com/issues/129006689. Does anyone see any
>> concerns with this? Is it always possible?
>>
>> Thanks,
>> Valentyn
>>
>> On Wed, Mar 20, 2019 at 12:45 PM Reuven Lax <re...@google.com> wrote:
>>
>>> The Java SDK relies on Jackson to do the encoding.
>>>
>>> On Wed, Mar 20, 2019 at 11:33 AM Chamikara Jayalath <
>>> chamik...@google.com> wrote:
>>>
>>>>
>>>>
>>>> On Wed, Mar 20, 2019 at 5:46 AM Juta Staes <juta.st...@ml6.eu> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>>
>>>>> I am working on porting beam to python 3 and discovered the following:
>>>>>
>>>>>
>>>>> Current handling of bytes in bigquery IO:
>>>>>
>>>>> When writing bytes to BQ , beam uses
>>>>> https://cloud.google.com/bigquery/docs/reference/rest/v2/. This API
>>>>> expects byte values to be base-64 encoded*.
>>>>>
>>>>> However when writing raw bytes they are currently never transformed to
>>>>> base-64 encoded strings. This results in the following errors:
>>>>>
>>>>>    -
>>>>>
>>>>>    When writing b’abc’ in python 2 this results in actually writing
>>>>>    b'i\xb7' which is the same as base64.b64decode('abc='))
>>>>>    -
>>>>>
>>>>>    When writing b’abc’ in python 3 this results in “TypeError: b'abc'
>>>>>    is not JSON serializable”
>>>>>    -
>>>>>
>>>>>    When writing b’\xab’ in py2/py3 this gives a “ValueError: 'utf8'
>>>>>    codec can't decode byte 0xab in position 0: invalid start byte. NAN, 
>>>>> INF
>>>>>    and -INF values are not JSON compliant”
>>>>>    -
>>>>>
>>>>>    When reading bytes from BQ they are currently returned as base-64
>>>>>    encoded strings rather then the raw bytes.
>>>>>
>>>>>
>>>>> Example code:
>>>>> https://docs.google.com/document/d/19zvDycWzF82MmtCmxrhqqyXKaRq8slRIjdxE6E8MObA/edit?usp=sharing
>>>>>
>>>>> There is also another issue when writing base-64 encoded string to BQ.
>>>>> When no schema is specified this results in “Invalid schema update. Field
>>>>> bytes has changed type from BYTES to STRING”.
>>>>>
>>>>> This error can be reproduced when uploading a file (directly in the BQ
>>>>> UI) to a table with bytes and using schema autodetect.
>>>>>
>>>>> Suggested solution:
>>>>>
>>>>> I suggest to change BigQuery IO to handle the base-64 encoding as
>>>>> follows to allow the user to read and write raw bytes in BQ
>>>>>
>>>>> Writing data:
>>>>>
>>>>>    -
>>>>>
>>>>>    When a new table is created we use the provided schema to detect
>>>>>    bytes and handle the base-64 encoding accordingly
>>>>>    -
>>>>>
>>>>>    When data is written to an existing table we use the API to get
>>>>>    the schema of the table and handle the base-64 encoding accordingly. We
>>>>>    also pass the schema as argument to avoid the error from schema 
>>>>> autodetect.
>>>>>
>>>>> Reading data:
>>>>>
>>>>>    -
>>>>>
>>>>>    When reading data we also request the schema and handle the
>>>>>    base-64 decoding accordingly to return raw bytes
>>>>>
>>>>>
>>>>> What are your thoughts on this?
>>>>>
>>>>
>>>> Thanks for the update. More context here:
>>>> https://issues.apache.org/jira/browse/BEAM-6769
>>>>
>>>> Suggested solution sounds good to me. BTW do you know how Java SDK
>>>> handles bytes type ? I believe we write JSON files and execute load jobs
>>>> there as well (when method is FILE_LOADS).
>>>>
>>>> Thanks,
>>>> Cham
>>>>
>>>>
>>>>>
>>>>> *I could not find this in the documentation of the API or in the
>>>>> documentation of BigQuery itself which also expects base-64 encoded 
>>>>> values.
>>>>> I discovered this when uploading a file to BQ UI and getting an error:
>>>>> "Could not decode base64 string to bytes."
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> [image: https://ml6.eu] <https://ml6.eu/>
>>>>>
>>>>> * Juta Staes*
>>>>> ML6 Gent
>>>>> <https://www.google.be/maps/place/ML6/@51.037408,3.7044893,17z/data=!3m1!4b1!4m5!3m4!1s0x47c37161feeca14b:0xb8f72585fdd21c90!8m2!3d51.037408!4d3.706678?hl=nl>
>>>>>
>>>>> **** DISCLAIMER ****
>>>>> This email and any files transmitted with it are confidential and
>>>>> intended solely for the use of the individual or entity to whom they are
>>>>> addressed. If you have received this email in error please notify the
>>>>> system manager. This message contains confidential information and is
>>>>> intended only for the individual named. If you are not the named addressee
>>>>> you should not disseminate, distribute or copy this e-mail. Please notify
>>>>> the sender immediately by e-mail if you have received this e-mail by
>>>>> mistake and delete this e-mail from your system. If you are not the
>>>>> intended recipient you are notified that disclosing, copying, distributing
>>>>> or taking any action in reliance on the contents of this information is
>>>>> strictly prohibited.
>>>>>
>>>>

Re: Writing bytes to BigQuery with beam

Reply via email to