wmoustafa opened a new pull request, #7392: URL: https://github.com/apache/iceberg/pull/7392
This PR adds a new name mapping mechanism that leverages mapping keys outside the Iceberg schema, e.g., from a source Avro schema. The application of such name mapping is mapping complex union type fields to their respective field IDs in the Iceberg schema. Some more context is in #5704. The logical sequence for reading union types is that there is an input Avro schema using which the union type data was originally written. Such schema is used to generate the table Iceberg schema. A complex union type in the Avro schema is converted to an Iceberg struct with fields `(int tag, Type0 field0, Type1 field1, ..)`, where each `Type_i` corresponds to a union branch from the input Avro schema, in the same order they appear with in the Avro schema. At read stage the Iceberg schema is being used to read the Avro files with union types, but there is no guarantee that the fields in the file correspond to the Iceberg schema struct in the same order; hence name mapping is required to guide how to connect the union type branches from the file to the respective fields in the Iceberg struct. Hence name mapping is used to relate schema from the file to the Iceberg IDs. The current name mapping mechanism builds the entire name mapping from the Iceberg schema as the sole source of truth. However, this is not extensible to cases where we want to map Avro union types to Iceberg structs because: * Avro union type options do not have a field name to begin with. * Other sources of identification (e.g., union branch data type or branch record type in case of Avro named types) do not make it to the Iceberg schema. Due to the above reasons, a supporting Avro schema is required to derive the name mapping, unlike the case with using field names where only the Iceberg schema is adequate. To map union types using both the source Avro schema and the derived Iceberg schema, the two schemas are traversed simultaneously, and the ID is extracted from the Iceberg schema and other identifying information is extracted from the Avro schema: * In case of union type branch that is a named type (e.g., `RECORD`, `ENUM`, `FIXED`), the record name is used as an identifier. * Else the type `toString()` value is used. The above is adequate identifying information according to Avro Spec since Avro unions could not contain more than one map or array type, or two of the same primitive types. Further, named type names are unique within the same union types. In non-union type branch cases (e.g., regular nested fields), standard field name is used as the name mapping key. Since the Avro schema in this scenario is the main schema from which the Iceberg schema is derived, reusing `AvroSchemaWithTypeVisitor` is not possible as the abstract visitor class since it assumes the Avro schema is already annotated with the IDs. Hence, this PR introduces a new abstract visitor class `AvroSchemaWithDerivedTypeVisitor` which assumes that the Avro schema is the source schema and the Iceberg schema is the derived schema. The name mapping logic is defined in `NameMappingWithAvroSchema`. Specific union-type-specific implementation that reflects the identification method above is in `NameMappingWithAvroSchema#union`. Testing: Added unit test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
