haizhou-zhao opened a new issue #4072:
URL: https://github.com/apache/iceberg/issues/4072
Hello,
We are using iceberg 0.11.x, and we had an issue with assembling an iceberg
schema from the result of `AvroSchemaUtil.buildAvroProjection`.
First, couple words on the overall architecture of our operation: we manage
iceberg tables for our customers and accept schema evolution requests from them
in the format of Avro. This is our schema compatibility check:
```
Given: [avro.Schema] currentSchema, [avro.Schema] updatedSchema
step 1: currentIcebergSchema = AvroSchemaUtil.toIceberg(currentSchema)
step 2: projectedIds = TypeUtil.getProjectedIds(currentIcebergSchema)
step 3: nameMapping = MappingUtil.create(currentIcebergSchema)
step 4: prunedSchema = AvroSchemaUtil.pruneColumns(updatedSchema,
projectedIds, nameMapping)
step 5: projectionSchema = AvroSchemaUtil.buildAvroProjection(prunedSchema,
currentIcebergSchema, HashMap.empty())
step 6: updatedIcebergSchema = AvroSchemaUtil.toIceberg(projectionSchema)
step 7: CheckCompatibility.writeCompatibilityErrors(currentIcebergSchema,
updatedIcebergSchema)
```
Recently, one of our customer submitted a schema change where they want to
add more optional fields to the element type of an array column.
```
<excerpt of schema change>
{
"type": "record",
"name": "top-level",
"fields": [
{
"type": "array",
"items": {
"type": "record",
"name": "element-record",
"fields": [
{
"name": "field1",
"type": ["null", "int"],
},
+ {
+ "name": "field2",
+ "type": ["null", "string"],
+ },
]
}
}
]
}
```
Then our compatibility check breaks at step 6
(`java.lang.IllegalArgumentException: Multiple entries with same key:`). What
we've discovered so far are:
1. each of the Avro node is assigned an id after step 5 for `prunedSchema`
2. the "top-level.element" node lost its assigned id from the `prunedSchema`
after step 6 in `projectionSchema`
[reference](https://github.com/apache/iceberg/blob/0.11.x/core/src/main/java/org/apache/iceberg/avro/BuildAvroProjection.java#L203)
3. when reassembling an Iceberg schema out of `projectionSchema` at step 7,
the "top-level.element" node, as it does not have "element-id" property, is
assigned a generated id that happens to collide its id with another node
[reference](https://github.com/apache/iceberg/blob/0.11.x/core/src/main/java/org/apache/iceberg/avro/SchemaToType.java#L75)
The questions are:
1. When projecting schema of array type, even if the underlying schema
differs, should the "element-id" property be preserved?
[Reference](https://github.com/apache/iceberg/blob/0.11.x/core/src/main/java/org/apache/iceberg/avro/BuildAvroProjection.java#L203)
2. If not, then is there an Iceberg API/Util that performs projection
(column reordering, etc.) while preserving ids of all node from the projected
schema?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]