wardlican opened a new issue, #3084:
URL: https://github.com/apache/parquet-java/issues/3084
### Describe the bug, including details regarding any error messages,
version, and platform.
When using iceberg, we encountered a situation where a parquet file we wrote
could not be read. When reading, the following error message appeared. Judging
from the exception information, it is speculated that the parquet file is
damaged or has not been written properly and cannot be parsed. We have also
tried a variety of parsing tools but cannot parse it normally. However, the
footer of the file is normal and the schema information of the file can be
obtained, but the read data cannot be parsed. The DataPageHeader.parquet
version is 1.13.1. Is there any tool that can restore damaged files?
```
org.apache.iceberg.exceptions.RuntimeIOException: java.io.IOException: can
not read class org.apache.iceberg.shaded.org.apache.parquet.format.PageHeader:
Required field 'num_values' was not found in serialized data! Struct:
org.apache.iceberg.shaded.org.apache.parquet.format.DataPageHeader$DataPageHeaderStandardScheme@57eb7595
at
org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.advance(VectorizedParquetReader.java:165)
at
org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.next(VectorizedParquetReader.java:141)
at org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:130)
at
org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:93)
at
org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:130)
at
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.columnartorow_nextBatch_0$(Unknown
Source)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.agg_doAggregateWithKeys_0$(Unknown
Source)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.processNext(Unknown
Source)
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
at
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1501)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.IOException: can not read class
org.apache.iceberg.shaded.org.apache.parquet.format.PageHeader: Required field
'num_values' was not found in serialized data! Struct:
org.apache.iceberg.shaded.org.apache.parquet.format.DataPageHeader$DataPageHeaderStandardScheme@57eb7595
at
org.apache.iceberg.shaded.org.apache.parquet.format.Util.read(Util.java:366)
at
org.apache.iceberg.shaded.org.apache.parquet.format.Util.readPageHeader(Util.java:133)
at
org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:1458)
at
org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1505)
at
org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1478)
at
org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.readChunkPages(ParquetFileReader.java:1088)
at
org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.internalReadRowGroup(ParquetFileReader.java:956)
at
org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:909)
at
org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.advance(VectorizedParquetReader.java:163)
... 23 more
```
### Component(s)
Thrift
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]