Fail to read back written large parquet file

Jozef Vilcek Thu, 04 Aug 2022 03:09:12 -0700

I came across a case where a job writes out a data set in parquet format
and it can not be read back as it appears to be corrupted.


Files fail to read back if their size is going over 2GB. If I set the job
to produce more smaller files from exactly the same input, all is good.

Job write to parquet Avro messages via `parquet-avro` and `parquet-mr`. It
does happen with v1.10.1 and v1.12.0.

Read error is:

Cannot seek to negative offset
java.io.EOFException: Cannot seek to negative offset
at org.apache.hadoop.hdfs.DFSInputStream.seek(DFSInputStream.java:1454)
at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:62)
at
org.apache.parquet.hadoop.util.H2SeekableInputStream.seek(H2SeekableInputStream.java:60)
at
org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1157)
at
org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805)

When digging a bit into the read, the code materializing
`ColumnChunkMetaData` is here [1] starting to see negative values for
`firstDataPage`. Printing some info from `reader.getRowGroups` yields:


startingPos=4, totalBytesSize=519551822, rowCount=2300100
startingPos=108156606, totalBytesSize=517597985, rowCount=2300100
...
startingPos=1950017569, totalBytesSize=511705703, rowCount=2300100
startingPos=2058233752, totalBytesSize=521762439, rowCount=2300100
startingPos=-2128348908, totalBytesSize=508570588, rowCount=2300100
startingPos=-2020294298, totalBytesSize=518901187, rowCount=2300100
startingPos=-1911848035, totalBytesSize=512724804, rowCount=2300100
startingPos=-1803573306, totalBytesSize=510980877, rowCount=2300100
startingPos=-1695543557, totalBytesSize=525871692, rowCount=2300100
startingPos=-1587016600, totalBytesSize=519353830, rowCount=2300100
startingPos=-1478696427, totalBytesSize=451032173, rowCount=2090372



Unfortunately, I was not able to reproduce it locally by taking avro schema
and generating random inputs and writing them out to a local file. Every
time, compressed or uncompressed, 3GB file was reading back correctly.

I am looking for help in finding a solution of hints in debugging this, as
I am out of clues to try to pinpoint and reproduce the problem.

Thanks!

[1]
https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/ColumnChunkMetaData.java#L127

Fail to read back written large parquet file

Reply via email to