szehon-ho opened a new pull request #2395: URL: https://github.com/apache/iceberg/pull/2395
* Quick-fix for problem was reported in https://github.com/apache/iceberg/issues/1378 * As Russell mentioned, he debugged the same thing in : https://github.com/apache/iceberg/pull/1744, that is trying a more complete fix. This pr is focused on fixing 'entries' and 'all-entries' table. * Background: When running Spark aggregation query on "entries" metadata table, empty projection is passed in. * However, data_file is required field as per Manifest schema spec, so this projection triggers java.lang.IllegalArgumentException: Missing required field: data_file in BuildAvroProjection.record * https://github.com/apache/iceberg/pull/1077 fixes it only for non-partitioned tables * This is only due to the peculiar behavior in PruneColumns where empty structs are not pruned away, thus 'data-file' is kept in the final projection when data-files.partitions is empty struct (non-partitioned table). In contrast, 'data-file' is not kept in final projection as non-empty structs with no fields matching projection are pruned away (partitioned-table). Full exception stack for reference: Caused by: java.lang.IllegalArgumentException: Missing required field: data_file at org.apache.iceberg.relocated.com.google.common.base.Preconditions.checkArgument(Preconditions.java:217) at org.apache.iceberg.avro.BuildAvroProjection.record(BuildAvroProjection.java:98) at org.apache.iceberg.avro.BuildAvroProjection.record(BuildAvroProjection.java:42) at org.apache.iceberg.avro.AvroCustomOrderSchemaVisitor.visit(AvroCustomOrderSchemaVisitor.java:51) at org.apache.iceberg.avro.AvroSchemaUtil.buildAvroProjection(AvroSchemaUtil.java:105) at org.apache.iceberg.avro.ProjectionDatumReader.setSchema(ProjectionDatumReader.java:68) at org.apache.iceberg.shaded.org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:132) at org.apache.iceberg.shaded.org.apache.avro.file.DataFileReader.(DataFileReader.java:106) at org.apache.iceberg.shaded.org.apache.avro.file.DataFileReader.(DataFileReader.java:98) at org.apache.iceberg.shaded.org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:66) at org.apache.iceberg.avro.AvroIterable.newFileReader(AvroIterable.java:100) at org.apache.iceberg.avro.AvroIterable.iterator(AvroIterable.java:77) at org.apache.iceberg.io.CloseableIterable$4$1.(CloseableIterable.java:99) at org.apache.iceberg.io.CloseableIterable$4.iterator(CloseableIterable.java:98) at org.apache.iceberg.io.CloseableIterable$4$1.(CloseableIterable.java:99) at org.apache.iceberg.io.CloseableIterable$4.iterator(CloseableIterable.java:98) at org.apache.iceberg.io.CloseableIterable$4$1.(CloseableIterable.java:99) at org.apache.iceberg.io.CloseableIterable$4.iterator(CloseableIterable.java:98) at org.apache.iceberg.io.CloseableIterable$4$1.(CloseableIterable.java:99) at org.apache.iceberg.io.CloseableIterable$4.iterator(CloseableIterable.java:98) at org.apache.iceberg.spark.source.RowDataReader.open(RowDataReader.java:95) at org.apache.iceberg.spark.source.BaseDataReader.next(BaseDataReader.java:86) at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:79) at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:112) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_1$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:897) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:897) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:372) at org.apache.spark.rdd.RDD.iterator(RDD.scala:336) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:480) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:483) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
