nbali opened a new issue, #25567:
URL: https://github.com/apache/beam/issues/25567

   ### What happened?
   
   When working with `BigQueryIO.Write.withAvroFormatFunction()` the function's 
input is `AvroWriteRequest<T>`, which is essentially `T` with an Avro `Schema`. 
The problem is how this provided schema gets created, and how it's being used.
   
   For that a configurable `withAvroSchemaFactory` method exists, but what it 
accepts is essentially a `SerializableFunction<@Nullable TableSchema, 
org.apache.avro.Schema>`, and this is the problem. `TableSchema` might already 
lost context. For example Avro knows INT and LONG, BQ recognizes no difference. 
So at the format function we might receive a schema that is different from our 
desired one, and there is no reversible way to transform it back.
   
   Using our desired schema at the format function doesn't work either, because 
the `DataFileWriter` in `AvroRowWriter` already uses a `DatumWriter` using the 
provided `Schema` that might fail.
   
   To give you a more exact example, I had a POJO with an `int` field, and 
already methods to provide me with the Beam `Row` and `Schema` and Avro 
`Schema,` and `GenericRecord` for that POJO as I need that for other purposes. 
The `TableSchema` returned by the `DynamicDestinations` obviously contains 
`long` due to supported BQ types, so the schemafactory-generated Avro schema 
also contains `long`. Meanwhile the `Row` and `GenericRecord` provided by my 
custom code contains `int` as the POJO did as well. ... and you guessed well, 
an exception happens when it tries to use an `int` as `long` during writing.
   
   My "methods" actually use `toAvroSchema` and `toGenericRecord` from 
`org.apache.beam.sdk.schemas.utils.AvroUtils`, so it's not some custom code, 
but internal Beam code.
   
   So to sum things up IMO the `SchemaFactory` should have the ability to use 
more context/infos than just a `TableSchema`. Given how it's being called at 
https://github.com/apache/beam/blob/40838f76447a5250b52645210f26dce3655d7009/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/RowWriterFactory.java#L143-L145
 using the `destination` might already be helpful.
   
   ### Issue Priority
   
   Priority: 2 (default / most bugs should be filed as P2)
   
   ### Issue Components
   
   - [ ] Component: Python SDK
   - [X] Component: Java SDK
   - [ ] Component: Go SDK
   - [ ] Component: Typescript SDK
   - [ ] Component: IO connector
   - [ ] Component: Beam examples
   - [ ] Component: Beam playground
   - [ ] Component: Beam katas
   - [ ] Component: Website
   - [ ] Component: Spark Runner
   - [ ] Component: Flink Runner
   - [ ] Component: Samza Runner
   - [ ] Component: Twister2 Runner
   - [ ] Component: Hazelcast Jet Runner
   - [ ] Component: Google Cloud Dataflow Runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to