nbali opened a new issue, #25567: URL: https://github.com/apache/beam/issues/25567
### What happened? When working with `BigQueryIO.Write.withAvroFormatFunction()` the function's input is `AvroWriteRequest<T>`, which is essentially `T` with an Avro `Schema`. The problem is how this provided schema gets created, and how it's being used. For that a configurable `withAvroSchemaFactory` method exists, but what it accepts is essentially a `SerializableFunction<@Nullable TableSchema, org.apache.avro.Schema>`, and this is the problem. `TableSchema` might already lost context. For example Avro knows INT and LONG, BQ recognizes no difference. So at the format function we might receive a schema that is different from our desired one, and there is no reversible way to transform it back. Using our desired schema at the format function doesn't work either, because the `DataFileWriter` in `AvroRowWriter` already uses a `DatumWriter` using the provided `Schema` that might fail. To give you a more exact example, I had a POJO with an `int` field, and already methods to provide me with the Beam `Row` and `Schema` and Avro `Schema,` and `GenericRecord` for that POJO as I need that for other purposes. The `TableSchema` returned by the `DynamicDestinations` obviously contains `long` due to supported BQ types, so the schemafactory-generated Avro schema also contains `long`. Meanwhile the `Row` and `GenericRecord` provided by my custom code contains `int` as the POJO did as well. ... and you guessed well, an exception happens when it tries to use an `int` as `long` during writing. My "methods" actually use `toAvroSchema` and `toGenericRecord` from `org.apache.beam.sdk.schemas.utils.AvroUtils`, so it's not some custom code, but internal Beam code. So to sum things up IMO the `SchemaFactory` should have the ability to use more context/infos than just a `TableSchema`. Given how it's being called at https://github.com/apache/beam/blob/40838f76447a5250b52645210f26dce3655d7009/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/RowWriterFactory.java#L143-L145 using the `destination` might already be helpful. ### Issue Priority Priority: 2 (default / most bugs should be filed as P2) ### Issue Components - [ ] Component: Python SDK - [X] Component: Java SDK - [ ] Component: Go SDK - [ ] Component: Typescript SDK - [ ] Component: IO connector - [ ] Component: Beam examples - [ ] Component: Beam playground - [ ] Component: Beam katas - [ ] Component: Website - [ ] Component: Spark Runner - [ ] Component: Flink Runner - [ ] Component: Samza Runner - [ ] Component: Twister2 Runner - [ ] Component: Hazelcast Jet Runner - [ ] Component: Google Cloud Dataflow Runner -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
