shardulm94 commented on a change in pull request #4242:
URL: https://github.com/apache/iceberg/pull/4242#discussion_r829345463
##########
File path: core/src/main/java/org/apache/iceberg/avro/PruneColumns.java
##########
@@ -323,4 +325,19 @@ private static Schema makeEmptyCopy(Schema field) {
private static boolean isOptionSchemaWithNonNullFirstOption(Schema schema) {
return AvroSchemaUtil.isOptionSchema(schema) &&
schema.getTypes().get(0).getType() != Schema.Type.NULL;
}
+
+ // for primitive types, the visitResult will be null, we want to reuse the
primitive types from the original
+ // schema, while for nested types, we want to use the visitResult because
they have content from the previous
+ // recursive calls.
+ private static Schema copyUnion(Schema record, List<Schema> visitResults) {
Review comment:
It does not look we prune any columns in the complex union case. I tried
modifying `TestSparkAvroUnions` to test projecting a single field from the
complex union and it fails (error below). There are a couple of ways we can go
about column pruning:
1. We should implement column pruning for complex unions. It should be okay
to implement this as a followup PR, but in that case I think we should throw a
better error message letting the user known that projecting fields within a
complex union is not supported for now.
2. We can choose not to prune columns inside complex unions and rather let
the engine handle the pruning. E.g. you can see a comment in `map()` => `//
right now, maps can't be selected without values`. The `ReadBuilder`s might
need to be handled to account for this change. We also need to make sure that
the schema advertised by Iceberg to the engine does not prune these fields
either. In Spark land, this would be the schema returned through
`Scan.readSchema()`.
```
index (1) must be less than size (1)
java.lang.IndexOutOfBoundsException: index (1) must be less than size (1)
at
org.apache.iceberg.relocated.com.google.common.base.Preconditions.checkElementIndex(Preconditions.java:1355)
at
org.apache.iceberg.relocated.com.google.common.base.Preconditions.checkElementIndex(Preconditions.java:1337)
at
org.apache.iceberg.relocated.com.google.common.collect.SingletonImmutableList.get(SingletonImmutableList.java:44)
at
org.apache.iceberg.avro.AvroSchemaWithTypeVisitor.visitUnion(AvroSchemaWithTypeVisitor.java:98)
at
org.apache.iceberg.avro.AvroSchemaWithTypeVisitor.visit(AvroSchemaWithTypeVisitor.java:41)
at
org.apache.iceberg.avro.AvroSchemaWithTypeVisitor.visitRecord(AvroSchemaWithTypeVisitor.java:71)
at
org.apache.iceberg.avro.AvroSchemaWithTypeVisitor.visit(AvroSchemaWithTypeVisitor.java:38)
at
org.apache.iceberg.avro.AvroSchemaWithTypeVisitor.visit(AvroSchemaWithTypeVisitor.java:32)
at
org.apache.iceberg.spark.data.SparkAvroReader.<init>(SparkAvroReader.java:57)
at
org.apache.iceberg.spark.data.SparkAvroReader.<init>(SparkAvroReader.java:50)
at
org.apache.iceberg.avro.Avro$ReadBuilder.lambda$build$1(Avro.java:652)
at
org.apache.iceberg.avro.ProjectionDatumReader.newDatumReader(ProjectionDatumReader.java:79)
at
org.apache.iceberg.avro.ProjectionDatumReader.setSchema(ProjectionDatumReader.java:69)
at
org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:133)
at org.apache.avro.file.DataFileReader.<init>(DataFileReader.java:130)
at org.apache.avro.file.DataFileReader.<init>(DataFileReader.java:122)
at
org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:66)
at
org.apache.iceberg.avro.AvroIterable.newFileReader(AvroIterable.java:100)
at org.apache.iceberg.avro.AvroIterable.iterator(AvroIterable.java:77)
at org.apache.iceberg.avro.AvroIterable.iterator(AvroIterable.java:37)
at
org.apache.iceberg.relocated.com.google.common.collect.Lists.newArrayList(Lists.java:133)
at
org.apache.iceberg.spark.data.TestSparkAvroUnions.writeAndValidateRequiredComplexUnion(TestSparkAvroUnions.java:75)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]