It turned out that ColumnMetaData.dictionary_page_offset is not impacted by this issue so it is much easier to handle. It seems that 1.12.0 is the first parquet_mr release which writes ColumnChunk.file_offset and according to PARQUET-2078 <https://issues.apache.org/jira/browse/PARQUET-2078?focusedCommentId=17405527&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17405527> it is invalid in certain cases. So any implementations need to have a way to calculate/gather this offset without using the actual field. What we need to ensure is that no one relies on the value of ColumnChunk.file_offset at least in cases when the file was written by parquet-mr 1.12.0.
I've also created PARQUET-2080 <https://issues.apache.org/jira/browse/PARQUET-2080> to deprecate the field in the format. Regards, Gabor On Fri, Aug 27, 2021 at 11:11 AM Gabor Szadovszky <[email protected]> wrote: > Hi everyone, > > It turned out that since parquet-mr 1.12.0 in certain conditions we write > wrong values into ColumnMetaData.dictionary_page_offset > <https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L753> > and > ColumnChunk.file_offset > <https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L790>. > See details in PARQUET-2078 > <https://issues.apache.org/jira/browse/PARQUET-2078?focusedCommentId=17405527&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17405527>. > Because of that any implementations that use these values have to be > prepared for potential invalid values in case the file is written by > parquet-mr 1.12.0. > > As per my understanding of the issue (to be verified) the > distinguish between valid and invalid values of these offsets is quite > easy: dictionary_page_offset is set to a value while the column chunk is > not dictionary encoded (as per ColumnMetaData.encodings > <https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L725>). > In this case we have to use the offset of the first data page > <https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L747> > in the first column chunk of the row group. > > Regards, > Gabor >
