wmoustafa commented on code in PR #7392:
URL: https://github.com/apache/iceberg/pull/7392#discussion_r1201676629
##########
core/src/main/java/org/apache/iceberg/avro/SchemaToType.java:
##########
@@ -104,13 +104,23 @@ public Type record(Schema record, List<String> names,
List<Type> fieldTypes) {
@Override
public Type union(Schema union, List<Type> options) {
- Preconditions.checkArgument(
- AvroSchemaUtil.isOptionSchema(union), "Unsupported type: non-option
union: %s", union);
- // records, arrays, and maps will check nullability later
- if (options.get(0) == null) {
- return options.get(1);
+ if (AvroSchemaUtil.isOptionSchema(union)) {
+ if (options.get(0) == null) {
+ return options.get(1);
+ } else {
+ return options.get(0);
+ }
} else {
- return options.get(0);
+ // Create list of Iceberg schema fields
+ List<Types.NestedField> fields =
Lists.newArrayListWithExpectedSize(options.size());
+ int tagIndex = 0;
+ fields.add(Types.NestedField.required(allocateId(), "tag",
Types.IntegerType.get()));
+ for (Type option : options) {
+ if (option != null) {
Review Comment:
The union null branch is to denote that the complex union as optional. The
corresponding field will always be null, and even when the union is null, the
whole corresponding struct is null (so it will not be the case that all fields
in the struct are null). Typically that is also how other engines go about
union to struct conversion. Here are some examples in
[Hive](ttps://github.com/apache/hive/blob/4f1dba31b25fb2c3b071492ff06b54f76a55c99a/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFExtractUnion.java#L193)
and
[Trino](https://github.com/trinodb/trino/blob/1621cb6d92b3e96b8204ba71ceca06e8090f0a58/plugin/trino-hive/src/main/java/io/trino/plugin/hive/util/HiveTypeTranslator.java#L217)
(they convert from Hive schema but it is always nullable). Also Spark
`SchemaConverters` [explicitly drops the null before mapping to
struct](https://github.com/apache/spark/blob/a5c53384def22b01b8ef28bee6f2d10648bce1a1/connector/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConver
ters.scala#L129).
We would want to be compatible with other other conventions because:
* It actually makes sense to view the union type as optional + the main
types, so the indexes map to the main types.
* Continue the convention adopted elsewhere and not create multiple
conventions/standards.
* Avoid adding data conversion adapters (between two schema conventions)
during migrations from other formats to Iceberg, especially in the Trino case,
where the schemas exactly match (`(tag, field_0, field_1,...)` with no
dedicated null field). Trino set the precedent that converts union to struct at
the table level (unlike Hive which uses explicit UDF, and Spark that does it at
the file level). In addition to avoiding migrations, we could establish the
schema field names/convention adopted by both Iceberg and Trino as the standard
for other engines to follow.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]