Thanks Gabor. The Spark community is in the process of releasing Spark 3.2.0 with Parquet 1.12. Any idea when a new release will be available with the fix? we may need to hold off the Spark release for that.
Chao On Mon, Aug 30, 2021 at 6:31 AM Gabor Szadovszky <[email protected]> wrote: > It turned out that ColumnMetaData.dictionary_page_offset is not impacted by > this issue so it is much easier to handle. It seems that 1.12.0 is the > first parquet_mr release which writes ColumnChunk.file_offset and according > to PARQUET-2078 > < > https://issues.apache.org/jira/browse/PARQUET-2078?focusedCommentId=17405527&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17405527 > > > it > is invalid in certain cases. So any implementations need to have a way to > calculate/gather this offset without using the actual field. What we need > to ensure is that no one relies on the value of ColumnChunk.file_offset at > least in cases when the file was written by parquet-mr 1.12.0. > > I've also created PARQUET-2080 > <https://issues.apache.org/jira/browse/PARQUET-2080> to deprecate the > field > in the format. > > Regards, > Gabor > > On Fri, Aug 27, 2021 at 11:11 AM Gabor Szadovszky <[email protected]> > wrote: > > > Hi everyone, > > > > It turned out that since parquet-mr 1.12.0 in certain conditions we write > > wrong values into ColumnMetaData.dictionary_page_offset > > < > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L753> > and > > ColumnChunk.file_offset > > < > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L790 > >. > > See details in PARQUET-2078 > > < > https://issues.apache.org/jira/browse/PARQUET-2078?focusedCommentId=17405527&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17405527 > >. > > Because of that any implementations that use these values have to be > > prepared for potential invalid values in case the file is written by > > parquet-mr 1.12.0. > > > > As per my understanding of the issue (to be verified) the > > distinguish between valid and invalid values of these offsets is quite > > easy: dictionary_page_offset is set to a value while the column chunk is > > not dictionary encoded (as per ColumnMetaData.encodings > > < > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L725 > >). > > In this case we have to use the offset of the first data page > > < > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L747 > > > > in the first column chunk of the row group. > > > > Regards, > > Gabor > > >
