yiqiangin opened a new pull request, #5704: URL: https://github.com/apache/iceberg/pull/5704
This PR consists of two parts - the support for non-optional union types which is cherry picked from the unmerged PR https://github.com/apache/iceberg/pull/4242 - the support for column projection in complex union which is an extension work of the previous PR In Iceberg, the complex union is represented by a struct with multiple fields. Without schema pruning caused by the column projection in the query, the number of fields equals to the number of types in the union plus one (for the tag field). When the column projection happens, the union schema of Iceberg is pruned and there are only a part of the fields in the struct according to the definition of column projection. In contrast, the union schema of Avro schema is not pruned in case of column projection, as the full union schema is needed to read the data from Avro file successfully. Also the readers to read the data of the union from Avro file are created based on the type schema from both Avro schema and Iceberg schema. The major problem to be solved here is to correlate the type in Avro schema with the type in Iceberg schema, especially in case that only a part of types exist in Iceberg schema with column projection. The main idea of the solution is as follows: - Build the mapping from the type name in Avro schema to the id of the corresponding field in Iceberg schema - When value readers are created, find the corresponding field in Iceberg schema for a type of Avro schema with the id stored in the mapping which key is the name of the type of Avro schema. The details of the implementation are as follows: - The mapping from the field name in Avro schema to the field id in Iceberg schema is derived during the conversion from Avro schema to Iceberg schema in the function of AvroSchemaUtil.convertToDeriveNameMapping and the class of SchemaToType. - The mapping of direct child fields of an Avro schema field is stored as a property named AvroFieldNameToIcebergId in this Avro schema field, therefore it can be easily accessed when Avro schema is traversed to generate the correspond readers to read Avro data file. - In case of union, the key of the mapping is the name of the branch in the union. - In case of complex union, the code of AvroSchemaWithTypeVisitor.visitUnion() first gets the mapping from the property of Avro schema, then get the field id in Iceberg schema using the type name in Avro schema, finally it uses the field id to get the field type in Iceberg schema: - if the corresponding field in Iceberg schema exists, the field is used to create the reader together with Avro schema node; - if the field for the given field id does not exist in Iceberg schema (which means this field is not projected in Iceberg schema), a pseudo branch type is created based on the corresponding Avro schema node to faciltate the creation of the reader. - In the class of UnionReader, the rows read from Avro data file are filtered according to the fields existing in Iceberg schema. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
