I came across a case where a job writes out a data set in parquet format and it can not be read back as it appears to be corrupted.
Files fail to read back if their size is going over 2GB. If I set the job to produce more smaller files from exactly the same input, all is good. Job write to parquet Avro messages via `parquet-avro` and `parquet-mr`. It does happen with v1.10.1 and v1.12.0. Read error is: Cannot seek to negative offset java.io.EOFException: Cannot seek to negative offset at org.apache.hadoop.hdfs.DFSInputStream.seek(DFSInputStream.java:1454) at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:62) at org.apache.parquet.hadoop.util.H2SeekableInputStream.seek(H2SeekableInputStream.java:60) at org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1157) at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805) When digging a bit into the read, the code materializing `ColumnChunkMetaData` is here [1] starting to see negative values for `firstDataPage`. Printing some info from `reader.getRowGroups` yields: startingPos=4, totalBytesSize=519551822, rowCount=2300100 startingPos=108156606, totalBytesSize=517597985, rowCount=2300100 ... startingPos=1950017569, totalBytesSize=511705703, rowCount=2300100 startingPos=2058233752, totalBytesSize=521762439, rowCount=2300100 startingPos=-2128348908, totalBytesSize=508570588, rowCount=2300100 startingPos=-2020294298, totalBytesSize=518901187, rowCount=2300100 startingPos=-1911848035, totalBytesSize=512724804, rowCount=2300100 startingPos=-1803573306, totalBytesSize=510980877, rowCount=2300100 startingPos=-1695543557, totalBytesSize=525871692, rowCount=2300100 startingPos=-1587016600, totalBytesSize=519353830, rowCount=2300100 startingPos=-1478696427, totalBytesSize=451032173, rowCount=2090372 Unfortunately, I was not able to reproduce it locally by taking avro schema and generating random inputs and writing them out to a local file. Every time, compressed or uncompressed, 3GB file was reading back correctly. I am looking for help in finding a solution of hints in debugging this, as I am out of clues to try to pinpoint and reproduce the problem. Thanks! [1] https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/ColumnChunkMetaData.java#L127