bastewart commented on issue #25228:
URL: https://github.com/apache/beam/issues/25228#issuecomment-1442119847

   Sorry, I had missed you reply!
   
   Summary on this message is I think in our case it's very hard to do the 
conversion. I think it's also reasonable to treat `1.0` as an integer, JSON 
schema [does for 
example](https://json-schema.org/understanding-json-schema/reference/numeric.html#integer).
 I also think it might also be a regression over the Streaming API 
implementation (at least this with #25227 and #25233 means there is a 
regression).
   
   > Even in the above example, not all INT64 values can be converted to Java 
double. e.g. the following code:
   
   Sorry, I may be misunderstanding, but that's not not the way round we have 
the issue.
   
   Our issue is Java doubles arriving and needing to be converted to an `INT64` 
for BigQuery. In that case - by my understanding ([see my fix 
commit](https://github.com/bastewart/beam/commit/38601213f81896444c60dd9e590f8a795358d09a))
 - it's trivial to optimistically  convert it to a long and fail if you've lost 
precision.
   
   > It also should be easy for any user to write a ParDo (or MapElements) that 
does their own numeric conversions before calling BigQueryIO.
   
   Unfortunately in our case this is very-much non-trivial 😅
   
   We write to 1000+ tables which have dynamic schemas*. This means we're 
relying on Beam to load the schema and convert the values for us. I think we'd 
have to re-implement, or use directly, most of `StorageApiDynamicDestinations` 
to grab schema info, and then do our own traversal of all data before handing 
it over to Beam.
   
   *I am aware that automatic schema reloading in the Beam Storage Writes API 
is not yet released...
   
   We also don't have direct/easy control over the types are the arrive. We're 
consuming a JSON blob off Kafka and bundling the data into BigQuery. Those JSON 
blobs _have_ been validated against the BigQuery schema using [JSON 
Schema](https://json-schema.org/). JSON schema [validates numbers with `0` 
fractional part  as an 
integer](https://json-schema.org/understanding-json-schema/reference/numeric.html#integer),
 so this would be hard for us to guard against.
   
   More generally, and personally, I feel like `1.0` should be treated as an 
integer. It's extremely easy for that kind of mis-conversion to occur, and 
being overly strict just leads to unexpected consequences. This issue coupled 
with #25227 and #25233 means that a lot of rows can be dropped "silently".
   
   ### Streaming API Behaviour
   
   I need to check (will get back to you later!), but I am reasonably sure the 
Streaming API allows this type of coercion to occur. I'll double check and get 
back to you though.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to