Hi everyone, It turned out that since parquet-mr 1.12.0 in certain conditions we write wrong values into ColumnMetaData.dictionary_page_offset <https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L753> and ColumnChunk.file_offset <https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L790>. See details in PARQUET-2078 <https://issues.apache.org/jira/browse/PARQUET-2078?focusedCommentId=17405527&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17405527>. Because of that any implementations that use these values have to be prepared for potential invalid values in case the file is written by parquet-mr 1.12.0.
As per my understanding of the issue (to be verified) the distinguish between valid and invalid values of these offsets is quite easy: dictionary_page_offset is set to a value while the column chunk is not dictionary encoded (as per ColumnMetaData.encodings <https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L725>). In this case we have to use the offset of the first data page <https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L747> in the first column chunk of the row group. Regards, Gabor
