wmoustafa opened a new pull request, #7392:
URL: https://github.com/apache/iceberg/pull/7392

   This PR adds a new name mapping mechanism that leverages mapping keys 
outside the Iceberg schema, e.g., from a source Avro schema. The application of 
such name mapping is mapping complex union type fields to their respective 
field IDs in the Iceberg schema. Some more context is in #5704.
   
   The logical sequence for reading union types is that there is an input Avro 
schema using which the union type data was originally written. Such schema is 
used to generate the table Iceberg schema. A complex union type in the Avro 
schema is converted to an Iceberg struct with fields `(int tag, Type0 field0, 
Type1 field1, ..)`, where each `Type_i` corresponds to a union branch from the 
input Avro schema, in the same order they appear with in the Avro schema. At 
read stage the Iceberg schema is being used to read the Avro files with union 
types, but there is no guarantee that the fields in the file correspond to the 
Iceberg schema struct in the same order; hence name mapping is required to 
guide how to connect the union type branches from the file to the respective 
fields in the Iceberg struct. Hence name mapping is used to relate schema from 
the file to the Iceberg IDs.
   
   The current name mapping mechanism builds the entire name mapping from the 
Iceberg schema as the sole source of truth. However, this is not extensible to 
cases where we want to map Avro union types to Iceberg structs because:
   * Avro union type options do not have a field name to begin with.
   * Other sources of identification (e.g., union branch data type or branch 
record type in case of Avro named types) do not make it to the Iceberg schema.
   Due to the above reasons, a supporting Avro schema is required to derive the 
name mapping, unlike the case with using field names where only the Iceberg 
schema is adequate.
   
   To map union types using both the source Avro schema and the derived Iceberg 
schema, the two schemas are traversed simultaneously, and the ID is extracted 
from the Iceberg schema and other identifying information is extracted from the 
Avro schema:
   * In case of union type branch that is a named type (e.g., `RECORD`, `ENUM`, 
`FIXED`), the record name is used as an identifier.
   * Else the type `toString()` value is used.
   
   The above is adequate identifying information according to Avro Spec since 
Avro unions could not contain more than one map or array type, or two of the 
same primitive types. Further, named type names are unique within the same 
union types.
   
   In non-union type branch cases (e.g., regular nested fields), standard field 
name is used as the name mapping key.
   
   Since the Avro schema in this scenario is the main schema from which the 
Iceberg schema is derived, reusing `AvroSchemaWithTypeVisitor` is not possible 
as the abstract visitor class since it assumes the Avro schema is already 
annotated with the IDs. Hence, this PR introduces a new abstract visitor class 
`AvroSchemaWithDerivedTypeVisitor` which assumes that the Avro schema is the 
source schema and the Iceberg schema is the derived schema. The name mapping 
logic is defined in `NameMappingWithAvroSchema`. Specific union-type-specific 
implementation that reflects the identification method above is in 
`NameMappingWithAvroSchema#union`.
   
   Testing:
   Added unit test.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to