On Wed, Mar 20, 2019 at 6:30 PM Pablo Estrada <pabl...@google.com> wrote:
> That sounds reasonable to me, Valentyn. > > Regarding (3), when the table already exists, it's not necessary to get > the schema. BQ is smart enough to load everything in appropriately. (as > long as bytes fields are base64 encoded) > > The problem is when the table does not exist and the user does not provide > a schema. In that case, there is no simple way of auto-inferring the > schema, as you correctly point out. I think it's reasonable to simply > expect users provide schemas if their data will have tricky types to infer. > Best > -P. > Is this even an option ? I think when table is not available users have to provide a schema to create a new table. > > > On Wed, Mar 20, 2019 at 3:44 PM Valentyn Tymofieiev <valen...@google.com> > wrote: > >> Thanks Juta for detailed analysis. >> >> I reached out to BigQuery team to improve documentation around treatment >> of Bytes and reported the issue that schema autodetection does not work >> <https://issuetracker.google.com/issues/129006689> for BYTES in GCP >> issue tracker <https://cloud.google.com/support/docs/issue-trackers>. >> >> Is this a correct summary of your proposal? >> >> 1. Beam will base64-encode raw bytes, before passing them to BQ over rest >> API. This will be a change in behavior for Python 2 (for good reasons). >> 2. When reading data from BQ, all fileds of type BYTES will be >> base64-decoded. >> 3. Beam will send an API call to BigQuery to get table schema, whenever >> schema is not supplied, to work around >> https://issuetracker.google.com/issues/129006689. Does anyone see any >> concerns with this? Is it always possible? >> >> Thanks, >> Valentyn >> >> On Wed, Mar 20, 2019 at 12:45 PM Reuven Lax <re...@google.com> wrote: >> >>> The Java SDK relies on Jackson to do the encoding. >>> >>> On Wed, Mar 20, 2019 at 11:33 AM Chamikara Jayalath < >>> chamik...@google.com> wrote: >>> >>>> >>>> >>>> On Wed, Mar 20, 2019 at 5:46 AM Juta Staes <juta.st...@ml6.eu> wrote: >>>> >>>>> Hi all, >>>>> >>>>> >>>>> I am working on porting beam to python 3 and discovered the following: >>>>> >>>>> >>>>> Current handling of bytes in bigquery IO: >>>>> >>>>> When writing bytes to BQ , beam uses >>>>> https://cloud.google.com/bigquery/docs/reference/rest/v2/. This API >>>>> expects byte values to be base-64 encoded*. >>>>> >>>>> However when writing raw bytes they are currently never transformed to >>>>> base-64 encoded strings. This results in the following errors: >>>>> >>>>> - >>>>> >>>>> When writing b’abc’ in python 2 this results in actually writing >>>>> b'i\xb7' which is the same as base64.b64decode('abc=')) >>>>> - >>>>> >>>>> When writing b’abc’ in python 3 this results in “TypeError: b'abc' >>>>> is not JSON serializable” >>>>> - >>>>> >>>>> When writing b’\xab’ in py2/py3 this gives a “ValueError: 'utf8' >>>>> codec can't decode byte 0xab in position 0: invalid start byte. NAN, >>>>> INF >>>>> and -INF values are not JSON compliant” >>>>> - >>>>> >>>>> When reading bytes from BQ they are currently returned as base-64 >>>>> encoded strings rather then the raw bytes. >>>>> >>>>> >>>>> Example code: >>>>> https://docs.google.com/document/d/19zvDycWzF82MmtCmxrhqqyXKaRq8slRIjdxE6E8MObA/edit?usp=sharing >>>>> >>>>> There is also another issue when writing base-64 encoded string to BQ. >>>>> When no schema is specified this results in “Invalid schema update. Field >>>>> bytes has changed type from BYTES to STRING”. >>>>> >>>>> This error can be reproduced when uploading a file (directly in the BQ >>>>> UI) to a table with bytes and using schema autodetect. >>>>> >>>>> Suggested solution: >>>>> >>>>> I suggest to change BigQuery IO to handle the base-64 encoding as >>>>> follows to allow the user to read and write raw bytes in BQ >>>>> >>>>> Writing data: >>>>> >>>>> - >>>>> >>>>> When a new table is created we use the provided schema to detect >>>>> bytes and handle the base-64 encoding accordingly >>>>> - >>>>> >>>>> When data is written to an existing table we use the API to get >>>>> the schema of the table and handle the base-64 encoding accordingly. We >>>>> also pass the schema as argument to avoid the error from schema >>>>> autodetect. >>>>> >>>>> Reading data: >>>>> >>>>> - >>>>> >>>>> When reading data we also request the schema and handle the >>>>> base-64 decoding accordingly to return raw bytes >>>>> >>>>> >>>>> What are your thoughts on this? >>>>> >>>> >>>> Thanks for the update. More context here: >>>> https://issues.apache.org/jira/browse/BEAM-6769 >>>> >>>> Suggested solution sounds good to me. BTW do you know how Java SDK >>>> handles bytes type ? I believe we write JSON files and execute load jobs >>>> there as well (when method is FILE_LOADS). >>>> >>>> Thanks, >>>> Cham >>>> >>>> >>>>> >>>>> *I could not find this in the documentation of the API or in the >>>>> documentation of BigQuery itself which also expects base-64 encoded >>>>> values. >>>>> I discovered this when uploading a file to BQ UI and getting an error: >>>>> "Could not decode base64 string to bytes." >>>>> >>>>> >>>>> -- >>>>> >>>>> [image: https://ml6.eu] <https://ml6.eu/> >>>>> >>>>> * Juta Staes* >>>>> ML6 Gent >>>>> <https://www.google.be/maps/place/ML6/@51.037408,3.7044893,17z/data=!3m1!4b1!4m5!3m4!1s0x47c37161feeca14b:0xb8f72585fdd21c90!8m2!3d51.037408!4d3.706678?hl=nl> >>>>> >>>>> **** DISCLAIMER **** >>>>> This email and any files transmitted with it are confidential and >>>>> intended solely for the use of the individual or entity to whom they are >>>>> addressed. If you have received this email in error please notify the >>>>> system manager. This message contains confidential information and is >>>>> intended only for the individual named. If you are not the named addressee >>>>> you should not disseminate, distribute or copy this e-mail. Please notify >>>>> the sender immediately by e-mail if you have received this e-mail by >>>>> mistake and delete this e-mail from your system. If you are not the >>>>> intended recipient you are notified that disclosing, copying, distributing >>>>> or taking any action in reliance on the contents of this information is >>>>> strictly prohibited. >>>>> >>>>