rdblue commented on code in PR #4242:
URL: https://github.com/apache/iceberg/pull/4242#discussion_r876089282
##########
core/src/main/java/org/apache/iceberg/avro/AvroSchemaWithTypeVisitor.java:
##########
@@ -79,11 +79,29 @@ private static <T> T visitRecord(Types.StructType struct,
Schema record, AvroSch
private static <T> T visitUnion(Type type, Schema union,
AvroSchemaWithTypeVisitor<T> visitor) {
List<Schema> types = union.getTypes();
List<T> options = Lists.newArrayListWithExpectedSize(types.size());
- for (Schema branch : types) {
- if (branch.getType() == Schema.Type.NULL) {
- options.add(visit((Type) null, branch, visitor));
- } else {
- options.add(visit(type, branch, visitor));
+
+ // simple union case
+ if (AvroSchemaUtil.isOptionSchema(union)) {
+ for (Schema branch : types) {
+ if (branch.getType() == Schema.Type.NULL) {
+ options.add(visit((Type) null, branch, visitor));
+ } else {
+ options.add(visit(type, branch, visitor));
+ }
+ }
+ } else { // complex union case
+ Preconditions.checkArgument(type instanceof Types.StructType,
+ "Cannot visit invalid Iceberg type: %s for Avro complex union type:
%s", type, union);
Review Comment:
> How about aligning by the type?
I think that's what we will need to do at some point, but this visitor
assumes that both schemas have field IDs. I think for this, the right way to
handle it is to get the field ID from the union type. It would mean rewriting
the Avro schema ahead of time to look like this:
```json
[
"null",
{"type": "int", "field-id": 34},
{"type": "string", "field-id": 35}
]
```
That's why I'm wondering about how to attach the field IDs in the name
mapping. In the name mapping, we could allow a nested level to represent the
union. Names in that level could be types rather than names, so the mapping to
produce the union above would be `[ { "field-id": 34, "names": ["int"] }, {
"field-id": 35, "names": ["string"] } ]`. That works for simple types. For
record, map, and array types we can use the simple type name as well,
`"record"` or `"map"` or `"array"`. That would support any union with just one
option of each nested type. If you had more than one map in the union, it would
fail. I think that's a reasonable starting place, though.
@wmoustafa, what do you think?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]