joachim-isaksson-centiro opened a new issue, #25526:
URL: https://github.com/apache/beam/issues/25526
### What happened?
I will add a failing test below, but basically we have a structure in our
system which looks something like;
class1 { identifier: record1 }
class2 { identifier: record2, class1: class1 }
That is, we have two separate members with the name "identifier" in two
different parts of the type we're trying to write to BigQuery.
When BigqueryIO calls BigQueryAvroUtils.toGenericAvroSchema() on the type,
it generates a schema for the structure, but unfortunately calling toString()
on the resulting avro schema crashes with;
Method threw 'org.apache.avro.SchemaParseException' exception.
It seems to be due to that;
* BigQueryAvroUtils.toGenericAvroSchema uses a _static_ namespace of
"org.apache.beam.sdk.io.gcp.bigquery" for all types, no matter where in the
type structure it's located. If it in this case for example added the
encompassing type to the namespace
(org.apache.beam.sdk.io.gcp.bigquery.class1.identifier), there should be no
problem.
* It seems to handle the member _name_ (identifier) as a type name in the
schema, so it thinks the two members with the same _name_ are trying to
redefine a _type_.
Not quite clear on the terminology here so I may be using it wrong, but
basically it tries to register org.apache.beam.sdk.io.gcp.bigquery.identifier
twice in org.apache.avro.Schema$Names.put and that crashes the write to BQ.
The structure is working without any issues up to Beam 2.42 but fails on
2.43 and 2.44.
To maybe make it clearer, here's a very basic unit test (in Kotlin, but
should translate over to java fairly easily I hope) that fails on the
toString() call; it builds the TableSchema manually, but in the same structure
as it's seems to be built by BigqueryIO for our type.
```
package org.apache.beam.sdk.io.gcp.bigquery;
import com.google.api.services.bigquery.model.TableFieldSchema
import org.junit.jupiter.api.Test
class SchemaTest {
@Test
fun test() {
val stringSchema1 =
TableFieldSchema().setName("id1").setType("STRING")
val stringSchema2 =
TableFieldSchema().setName("id2").setType("STRING")
val identifier1Schema =
TableFieldSchema().setName("identifier").setType("RECORD")
.setFields(listOf(stringSchema1))
val identifier2Schema =
TableFieldSchema().setName("identifier").setType("RECORD")
.setFields(listOf(stringSchema2))
val recordSchema =
TableFieldSchema().setName("record").setType("RECORD")
.setFields(listOf(identifier1Schema))
val rootSchema = TableFieldSchema().setName("root").setType("RECORD")
.setFields(listOf(recordSchema, identifier2Schema))
val output = BigQueryAvroUtils.toGenericAvroSchema("root",
rootSchema.fields)
val outputAsString = output.toString()
}
}
```
The test fails as is, but renaming the member id2 to id1 so that both
instances of the member with the name "identifier" are seen as the same type
makes the test pass.
If it helps, I'll try to make a more complete example that builds the
TableSchema from the type in the same way BigqueryIO does, but I hope this
makes the problem clear.
### Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
### Issue Components
- [ ] Component: Python SDK
- [ ] Component: Java SDK
- [ ] Component: Go SDK
- [ ] Component: Typescript SDK
- [X] Component: IO connector
- [ ] Component: Beam examples
- [ ] Component: Beam playground
- [ ] Component: Beam katas
- [ ] Component: Website
- [ ] Component: Spark Runner
- [ ] Component: Flink Runner
- [ ] Component: Samza Runner
- [ ] Component: Twister2 Runner
- [ ] Component: Hazelcast Jet Runner
- [ ] Component: Google Cloud Dataflow Runner
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]