[GitHub] [beam] bastewart commented on issue #25228: [Feature Request]: Less strict numeric conversions writing to BigQuery with Storage API

via GitHub Thu, 23 Feb 2023 09:02:04 -0800


bastewart commented on issue #25228:
URL: https://github.com/apache/beam/issues/25228#issuecomment-1442119847

Sorry, I had missed you reply!

Summary on this message is I think in our case it's very hard to do the
conversion. I think it's also reasonable to treat `1.0` as an integer, JSON
schema [does for
example](https://json-schema.org/understanding-json-schema/reference/numeric.html#integer).
I also think it might also be a regression over the Streaming API
implementation (at least this with #25227 and #25233 means there is a
regression).

> Even in the above example, not all INT64 values can be converted to Java
double. e.g. the following code:

Sorry, I may be misunderstanding, but that's not not the way round we have
the issue.

Our issue is Java doubles arriving and needing to be converted to an `INT64`
for BigQuery. In that case - by my understanding ([see my fix
commit](https://github.com/bastewart/beam/commit/38601213f81896444c60dd9e590f8a795358d09a))
- it's trivial to optimistically convert it to a long and fail if you've lost
precision.

> It also should be easy for any user to write a ParDo (or MapElements) that
does their own numeric conversions before calling BigQueryIO.

Unfortunately in our case this is very-much non-trivial 😅

We write to 1000+ tables which have dynamic schemas*. This means we're
relying on Beam to load the schema and convert the values for us. I think we'd
have to re-implement, or use directly, most of `StorageApiDynamicDestinations`
to grab schema info, and then do our own traversal of all data before handing
it over to Beam.

*I am aware that automatic schema reloading in the Beam Storage Writes API
is not yet released...

We also don't have direct/easy control over the types are the arrive. We're
consuming a JSON blob off Kafka and bundling the data into BigQuery. Those JSON
blobs _have_ been validated against the BigQuery schema using [JSON
Schema](https://json-schema.org/). JSON schema [validates numbers with `0`
fractional part as an
integer](https://json-schema.org/understanding-json-schema/reference/numeric.html#integer),
so this would be hard for us to guard against.

More generally, and personally, I feel like `1.0` should be treated as an
integer. It's extremely easy for that kind of mis-conversion to occur, and
being overly strict just leads to unexpected consequences. This issue coupled
with #25227 and #25233 means that a lot of rows can be dropped "silently".

### Streaming API Behaviour

I need to check (will get back to you later!), but I am reasonably sure the
Streaming API allows this type of coercion to occur. I'll double check and get
back to you though.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] bastewart commented on issue #25228: [Feature Request]: Less strict numeric conversions writing to BigQuery with Storage API

Reply via email to