xglv1985 opened a new issue #2812: URL: https://github.com/apache/hudi/issues/2812
when I ran a Spark 2.4 job to incremental query a MOR table, the job failed with the following errors: `2021-04-13 12:39:54 [Executor task launch worker for task 215] ERROR [Executor:91]: Exception in task 160.2 in stage 0.0 (TID 215) java.io.IOException: can not read class org.apache.parquet.format.PageHeader: null at org.apache.parquet.format.Util.read(Util.java:216) at org.apache.parquet.format.Util.readPageHeader(Util.java:65) at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:936) at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:950) at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:807) at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:127) at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:222) at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at org.apache.hudi.HoodieMergeOnReadRDD$$anon$3.hasNext(HoodieMergeOnReadRDD.scala:217) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: shaded.parquet.org.apache.thrift.transport.TTransportException at shaded.parquet.org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132) at shaded.parquet.org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.readBinary(TCompactProtocol.java:709) at org.apache.parquet.format.InterningProtocol.readBinary(InterningProtocol.java:223) at shaded.parquet.org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:102) at shaded.parquet.org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:138) at shaded.parquet.org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:130) at shaded.parquet.org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:60) at org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1047) at org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:966) at org.apache.parquet.format.PageHeader.read(PageHeader.java:843) at org.apache.parquet.format.Util.read(Util.java:213) ... 25 more` and `2021-04-13 12:40:07 [Executor task launch worker for task 1113] ERROR [Executor:91]: Exception in task 206.4 in stage 0.0 (TID 1113) java.io.IOException: can not read class org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size' was not found in serialized data! Struct: org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@ec35f50 at org.apache.parquet.format.Util.read(Util.java:216) at org.apache.parquet.format.Util.readPageHeader(Util.java:65) at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:936) at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:950) at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:807) at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:127) at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:222) at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at org.apache.hudi.HoodieMergeOnReadRDD$$anon$3.hasNext(HoodieMergeOnReadRDD.scala:217) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: Required field 'uncompressed_page_size' was not found in serialized data! Struct: org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@ec35f50 at org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1055) at org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:966) at org.apache.parquet.format.PageHeader.read(PageHeader.java:843) at org.apache.parquet.format.Util.read(Util.java:213)` my scala code is as follows: `def main(args: Array[String]) { val spark = SparkSession.builder().getOrCreate() val conf = spark.conf val input = conf.get(Constants.SPARK_INPUT) val sql = conf.get(Constants.SPARK_SQL_QUERY) val df = spark.read.format("hudi"). option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL). option(BEGIN_INSTANTTIME_OPT_KEY, "20210413081500"). load(input) df.createOrReplaceTempView("hudi_materials_incremental") spark.sql(sql).show() df.printSchema() }` In case it was caused by parquet file itself, I did an experiment that I changed the code spark.read.format("hudi") to spark.read.format("parquet"), removed the next two lines with hudi options, and re-ran the job. I got no error at all. So what on earth caused the error? How should I do with my code or configuration, to eliminate the error? Thanks! **Environment Description** * Hudi version : 0.8.0 * Spark version : 2.4 * Hive version : * Hadoop version : * Storage (HDFS/S3/GCS..) : HDFS * Running on Docker? (yes/no) : **Additional context** Add any other context about the problem here. **Stacktrace** ```Add the stacktrace of the error.``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org