[GitHub] [beam] nbali opened a new issue, #25567: [Bug]: Precision/context loss in schema when writing to BQ with Avro

via GitHub Mon, 20 Feb 2023 10:36:51 -0800


nbali opened a new issue, #25567:
URL: https://github.com/apache/beam/issues/25567

### What happened?

When working with `BigQueryIO.Write.withAvroFormatFunction()` the function's
input is `AvroWriteRequest<T>`, which is essentially `T` with an Avro `Schema`.
The problem is how this provided schema gets created, and how it's being used.

For that a configurable `withAvroSchemaFactory` method exists, but what it
accepts is essentially a `SerializableFunction<@Nullable TableSchema,
org.apache.avro.Schema>`, and this is the problem. `TableSchema` might already
lost context. For example Avro knows INT and LONG, BQ recognizes no difference.
So at the format function we might receive a schema that is different from our
desired one, and there is no reversible way to transform it back.

Using our desired schema at the format function doesn't work either, because
the `DataFileWriter` in `AvroRowWriter` already uses a `DatumWriter` using the
provided `Schema` that might fail.

To give you a more exact example, I had a POJO with an `int` field, and
already methods to provide me with the Beam `Row` and `Schema` and Avro
`Schema,` and `GenericRecord` for that POJO as I need that for other purposes.
The `TableSchema` returned by the `DynamicDestinations` obviously contains
`long` due to supported BQ types, so the schemafactory-generated Avro schema
also contains `long`. Meanwhile the `Row` and `GenericRecord` provided by my
custom code contains `int` as the POJO did as well. ... and you guessed well,
an exception happens when it tries to use an `int` as `long` during writing.

My "methods" actually use `toAvroSchema` and `toGenericRecord` from
`org.apache.beam.sdk.schemas.utils.AvroUtils`, so it's not some custom code,
but internal Beam code.

So to sum things up IMO the `SchemaFactory` should have the ability to use
more context/infos than just a `TableSchema`. Given how it's being called at
https://github.com/apache/beam/blob/40838f76447a5250b52645210f26dce3655d7009/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/RowWriterFactory.java#L143-L145
using the `destination` might already be helpful.

### Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

### Issue Components

- [ ] Component: Python SDK
- [X] Component: Java SDK
- [ ] Component: Go SDK
- [ ] Component: Typescript SDK
- [ ] Component: IO connector
- [ ] Component: Beam examples
- [ ] Component: Beam playground
- [ ] Component: Beam katas
- [ ] Component: Website
- [ ] Component: Spark Runner
- [ ] Component: Flink Runner
- [ ] Component: Samza Runner
- [ ] Component: Twister2 Runner
- [ ] Component: Hazelcast Jet Runner
- [ ] Component: Google Cloud Dataflow Runner

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] nbali opened a new issue, #25567: [Bug]: Precision/context loss in schema when writing to BQ with Avro

Reply via email to