It turned out that ColumnMetaData.dictionary_page_offset is not impacted by
this issue so it is much easier to handle. It seems that 1.12.0 is the
first parquet_mr release which writes ColumnChunk.file_offset and according
to PARQUET-2078
<https://issues.apache.org/jira/browse/PARQUET-2078?focusedCommentId=17405527&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17405527>
it
is invalid in certain cases. So any implementations need to have a way to
calculate/gather this offset without using the actual field. What we need
to ensure is that no one relies on the value of ColumnChunk.file_offset at
least in cases when the file was written by parquet-mr 1.12.0.

I've also created PARQUET-2080
<https://issues.apache.org/jira/browse/PARQUET-2080> to deprecate the field
in the format.

Regards,
Gabor

On Fri, Aug 27, 2021 at 11:11 AM Gabor Szadovszky <[email protected]> wrote:

> Hi everyone,
>
> It turned out that since parquet-mr 1.12.0 in certain conditions we write
> wrong values into ColumnMetaData.dictionary_page_offset
> <https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L753>
>  and
> ColumnChunk.file_offset
> <https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L790>.
> See details in PARQUET-2078
> <https://issues.apache.org/jira/browse/PARQUET-2078?focusedCommentId=17405527&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17405527>.
> Because of that any implementations that use these values have to be
> prepared for potential invalid values in case the file is written by
> parquet-mr 1.12.0.
>
> As per my understanding of the issue (to be verified) the
> distinguish between valid and invalid values of these offsets is quite
> easy: dictionary_page_offset is set to a value while the column chunk is
> not dictionary encoded (as per ColumnMetaData.encodings
> <https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L725>).
> In this case we have to use the offset of the first data page
> <https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L747>
> in the first column chunk of the row group.
>
> Regards,
> Gabor
>

Reply via email to