Any Parquet implementations might be impacted by PARQUET-2078

Gabor Szadovszky Fri, 27 Aug 2021 02:11:58 -0700

Hi everyone,

It turned out that since parquet-mr 1.12.0 in certain conditions we write
wrong values into ColumnMetaData.dictionary_page_offset
<https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L753>
and
ColumnChunk.file_offset
<https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L790>.
See details in PARQUET-2078
<https://issues.apache.org/jira/browse/PARQUET-2078?focusedCommentId=17405527&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17405527>.
Because of that any implementations that use these values have to be
prepared for potential invalid values in case the file is written by
parquet-mr 1.12.0.


As per my understanding of the issue (to be verified) the
distinguish between valid and invalid values of these offsets is quite
easy: dictionary_page_offset is set to a value while the column chunk is
not dictionary encoded (as per ColumnMetaData.encodings
<https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L725>).
In this case we have to use the offset of the first data page
<https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L747>
in the first column chunk of the row group.

Regards,
Gabor

Any Parquet implementations might be impacted by PARQUET-2078

Reply via email to