yiqiangin commented on code in PR #5704:
URL: https://github.com/apache/iceberg/pull/5704#discussion_r990346173
##########
core/src/main/java/org/apache/iceberg/avro/AvroSchemaWithTypeVisitor.java:
##########
@@ -82,11 +83,45 @@ private static <T> T visitRecord(
private static <T> T visitUnion(Type type, Schema union,
AvroSchemaWithTypeVisitor<T> visitor) {
List<Schema> types = union.getTypes();
List<T> options = Lists.newArrayListWithExpectedSize(types.size());
- for (Schema branch : types) {
- if (branch.getType() == Schema.Type.NULL) {
- options.add(visit((Type) null, branch, visitor));
- } else {
- options.add(visit(type, branch, visitor));
+
+ // simple union case
+ if (AvroSchemaUtil.isOptionSchema(union)) {
+ for (Schema branch : types) {
+ if (branch.getType() == Schema.Type.NULL) {
+ options.add(visit((Type) null, branch, visitor));
+ } else {
+ options.add(visit(type, branch, visitor));
+ }
+ }
+ } else { // complex union case
+ Preconditions.checkArgument(
+ type instanceof Types.StructType,
+ "Cannot visit invalid Iceberg type: %s for Avro complex union type:
%s",
+ type,
+ union);
+ Map<String, Integer> fieldNameToId =
+ (Map)
union.getObjectProp(SchemaToType.AVRO_FIELD_NAME_TO_ICEBERG_ID);
+ for (Schema branch : types) {
+ if (branch.getType() == Schema.Type.NULL) {
+ options.add(visit((Type) null, branch, visitor));
+ } else {
+ String name =
+ branch.getType().equals(Schema.Type.RECORD)
+ ? branch.getName()
+ : branch.getType().getName();
+ if (fieldNameToId.containsKey(name)) {
+ int fieldId = fieldNameToId.get(name);
+ Types.NestedField branchType = type.asStructType().field(fieldId);
+ if (branchType != null) {
+ options.add(visit(branchType.type(), branch, visitor));
+ } else {
+ Type pseudoBranchType = AvroSchemaUtil.convert(branch);
+ options.add(visit(pseudoBranchType, branch, visitor));
+ }
+ } else {
+ options.add(visit((Type) null, branch, visitor));
+ }
+ }
}
}
return visitor.union(type, union, options);
Review Comment:
I am afraid not. The readers for all branch types need to pass into
`UnionReader`, as `UnionReader` needs to read all types of data from the union
in Avro file to read the all the records successfully. Filtering the data based
on the types projected in expected Iceberg schema can only happens after the
data are read from Avro file
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]