joachim-isaksson-centiro opened a new issue, #25526:
URL: https://github.com/apache/beam/issues/25526

   ### What happened?
   
   I will add a failing test below, but basically we have a structure in our 
system which looks something like;
   
   class1 { identifier: record1 }
   class2 { identifier: record2, class1: class1 }
   
   That is, we have two separate members with the name "identifier" in two 
different parts of the type we're trying to write to BigQuery.
   
   When BigqueryIO calls BigQueryAvroUtils.toGenericAvroSchema() on the type, 
it generates a schema for the structure, but unfortunately calling toString() 
on the resulting avro schema crashes with;
   
   Method threw 'org.apache.avro.SchemaParseException' exception.
   
   It seems to be due to that;
   
   * BigQueryAvroUtils.toGenericAvroSchema uses a _static_ namespace of 
"org.apache.beam.sdk.io.gcp.bigquery" for all types, no matter where in the 
type structure it's located. If it in this case for example added the 
encompassing type to the namespace 
(org.apache.beam.sdk.io.gcp.bigquery.class1.identifier), there should be no 
problem.
   
   * It seems to handle the member _name_ (identifier) as a type name in the 
schema, so it thinks the two members with the same _name_ are trying to 
redefine a _type_.
   
   Not quite clear on the terminology here so I may be using it wrong, but 
basically it tries to register org.apache.beam.sdk.io.gcp.bigquery.identifier 
twice in org.apache.avro.Schema$Names.put and that crashes the write to BQ.
   
   The structure is working without any issues up to Beam 2.42 but fails on 
2.43 and 2.44.
   
   To maybe make it clearer, here's a very basic unit test (in Kotlin, but 
should translate over to java fairly easily I hope) that fails on the 
toString() call; it builds the TableSchema manually, but in the same structure 
as it's seems to be built by BigqueryIO for our type.
   
   ```
   package org.apache.beam.sdk.io.gcp.bigquery;
   
   import com.google.api.services.bigquery.model.TableFieldSchema
   import org.junit.jupiter.api.Test
   
   class SchemaTest {
   
       @Test
       fun test() {
   
           val stringSchema1 = 
TableFieldSchema().setName("id1").setType("STRING")
           val stringSchema2 = 
TableFieldSchema().setName("id2").setType("STRING")
   
           val identifier1Schema = 
TableFieldSchema().setName("identifier").setType("RECORD")
               .setFields(listOf(stringSchema1))
   
           val identifier2Schema = 
TableFieldSchema().setName("identifier").setType("RECORD")
               .setFields(listOf(stringSchema2))
   
           val recordSchema = 
TableFieldSchema().setName("record").setType("RECORD")
               .setFields(listOf(identifier1Schema))
   
           val rootSchema = TableFieldSchema().setName("root").setType("RECORD")
               .setFields(listOf(recordSchema, identifier2Schema))
   
           val output = BigQueryAvroUtils.toGenericAvroSchema("root", 
rootSchema.fields)
   
           val outputAsString = output.toString()
       }
   }
   ```
   
   The test fails as is, but renaming the member id2 to id1 so that both 
instances of the member with the name "identifier" are seen as the same type 
makes the test pass.
   
   If it helps, I'll try to make a more complete example that builds the 
TableSchema from the type in the same way BigqueryIO does, but I hope this 
makes the problem clear.
   
   ### Issue Priority
   
   Priority: 2 (default / most bugs should be filed as P2)
   
   ### Issue Components
   
   - [ ] Component: Python SDK
   - [ ] Component: Java SDK
   - [ ] Component: Go SDK
   - [ ] Component: Typescript SDK
   - [X] Component: IO connector
   - [ ] Component: Beam examples
   - [ ] Component: Beam playground
   - [ ] Component: Beam katas
   - [ ] Component: Website
   - [ ] Component: Spark Runner
   - [ ] Component: Flink Runner
   - [ ] Component: Samza Runner
   - [ ] Component: Twister2 Runner
   - [ ] Component: Hazelcast Jet Runner
   - [ ] Component: Google Cloud Dataflow Runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to