yiqiangin opened a new pull request, #5704:
URL: https://github.com/apache/iceberg/pull/5704

   This PR consists of two parts
   - the support for non-optional union types which is cherry picked from the 
unmerged PR https://github.com/apache/iceberg/pull/4242
   - the support for column projection in complex union which is an extension 
work of the previous PR
   
   In Iceberg, the complex union is represented by a struct with multiple 
fields. Without schema pruning caused by the column projection in the query, 
the number of fields equals to the number of types in the union plus one (for 
the tag field). When the column projection happens, the union schema of Iceberg 
is pruned and there are only a part of the fields in the struct according to 
the definition of column projection.
   In contrast, the union schema of Avro schema is not pruned in case of column 
projection, as the full union schema is needed to read the data from Avro file 
successfully.
   Also the readers to read the data of the union from Avro file are created 
based on the type schema from both Avro schema and Iceberg schema. The major 
problem to be solved here is to correlate the type in Avro schema with the type 
in Iceberg schema, especially in case that only a part of types exist in 
Iceberg schema with column projection.
   
   The main idea of the solution is as follows:
   - Build the mapping from the type name in Avro schema to the id of the 
corresponding field in Iceberg schema
   - When value readers are created, find the corresponding field in Iceberg 
schema for a type of Avro schema with the id stored in the mapping which key is 
the name of the type of Avro schema.
   
   The details of the implementation are as follows:
   - The mapping from the field name in Avro schema to the field id in Iceberg 
schema is derived during the conversion from Avro schema to Iceberg schema in 
the function of AvroSchemaUtil.convertToDeriveNameMapping and the class of 
SchemaToType.
   - The mapping of direct child fields of an Avro schema field is stored as a 
property named AvroFieldNameToIcebergId in this Avro schema field, therefore it 
can be easily accessed when Avro schema is traversed to generate the correspond 
readers to read Avro data file.
   - In case of union, the key of the mapping is the name of the branch in the 
union.
   - In case of complex union, the code of 
AvroSchemaWithTypeVisitor.visitUnion() first gets the mapping from the property 
of Avro schema, then get the field id in Iceberg schema using the type name in 
Avro schema, finally it uses the field id to get the field type in Iceberg 
schema:
      - if the corresponding field in Iceberg schema exists, the field is used 
to create the reader together with Avro schema node;
      - if the field for the given field id does not exist in Iceberg schema 
(which means this field is not projected in Iceberg schema), a pseudo branch 
type is created based on the corresponding Avro schema node to faciltate the 
creation of the reader.
   - In the class of UnionReader, the rows read from Avro data file are 
filtered according to the fields existing in Iceberg schema.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to