[GitHub] [iceberg] wmoustafa commented on a diff in pull request #7392: Avro: Add Avro-assisted name mapping

via GitHub Tue, 23 May 2023 00:28:04 -0700


wmoustafa commented on code in PR #7392:
URL: https://github.com/apache/iceberg/pull/7392#discussion_r1201676629



##########
core/src/main/java/org/apache/iceberg/avro/SchemaToType.java:
##########
@@ -104,13 +104,23 @@ public Type record(Schema record, List<String> names, 
List<Type> fieldTypes) {
 
   @Override
   public Type union(Schema union, List<Type> options) {
-    Preconditions.checkArgument(
-        AvroSchemaUtil.isOptionSchema(union), "Unsupported type: non-option 
union: %s", union);
-    // records, arrays, and maps will check nullability later
-    if (options.get(0) == null) {
-      return options.get(1);
+    if (AvroSchemaUtil.isOptionSchema(union)) {
+      if (options.get(0) == null) {
+        return options.get(1);
+      } else {
+        return options.get(0);
+      }
     } else {
-      return options.get(0);
+      // Create list of Iceberg schema fields
+      List<Types.NestedField> fields = 
Lists.newArrayListWithExpectedSize(options.size());
+      int tagIndex = 0;
+      fields.add(Types.NestedField.required(allocateId(), "tag", 
Types.IntegerType.get()));
+      for (Type option : options) {
+        if (option != null) {

Review Comment:
   The union null branch is to denote that the complex union as optional. The 
corresponding field will always be null, and even when the union is null, the 
whole corresponding struct is null (so it will not be the case that all fields 
in the struct are null). Typically that is also how other engines go about 
union to struct conversion. Here are some examples in 
[Hive](ttps://github.com/apache/hive/blob/4f1dba31b25fb2c3b071492ff06b54f76a55c99a/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFExtractUnion.java#L193)
 and 
[Trino](https://github.com/trinodb/trino/blob/1621cb6d92b3e96b8204ba71ceca06e8090f0a58/plugin/trino-hive/src/main/java/io/trino/plugin/hive/util/HiveTypeTranslator.java#L217)
 (they convert from Hive schema but it is always nullable). Also Spark 
`SchemaConverters` [explicitly drops the null before mapping to 
struct](https://github.com/apache/spark/blob/a5c53384def22b01b8ef28bee6f2d10648bce1a1/connector/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConver
 ters.scala#L129).
   
   We would want to be compatible with other other conventions because:
   * It actually makes sense to view the union type as optional + the main 
types, so the indexes map to the main types.
   * Continue the convention adopted elsewhere and not create multiple 
conventions/standards.
   * Avoid adding data conversion adapters (between two schema conventions) 
during migrations from other formats to Iceberg, especially in the Trino case, 
where the schemas exactly match (`(tag, field_0, field_1,...)` with no 
dedicated null field). Trino set the precedent that converts union to struct at 
the table level (unlike Hive which uses explicit UDF, and Spark that does it at 
the file level). In addition to avoiding migrations, we could establish the 
schema field names/convention adopted by both Iceberg and Trino as the standard 
for other engines to follow.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] wmoustafa commented on a diff in pull request #7392: Avro: Add Avro-assisted name mapping

Reply via email to