Thanks Gabor. The Spark community is in the process of releasing Spark
3.2.0 with Parquet 1.12. Any idea when a new release will be available with
the fix? we may need to hold off the Spark release for that.

Chao

On Mon, Aug 30, 2021 at 6:31 AM Gabor Szadovszky <[email protected]> wrote:

> It turned out that ColumnMetaData.dictionary_page_offset is not impacted by
> this issue so it is much easier to handle. It seems that 1.12.0 is the
> first parquet_mr release which writes ColumnChunk.file_offset and according
> to PARQUET-2078
> <
> https://issues.apache.org/jira/browse/PARQUET-2078?focusedCommentId=17405527&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17405527
> >
> it
> is invalid in certain cases. So any implementations need to have a way to
> calculate/gather this offset without using the actual field. What we need
> to ensure is that no one relies on the value of ColumnChunk.file_offset at
> least in cases when the file was written by parquet-mr 1.12.0.
>
> I've also created PARQUET-2080
> <https://issues.apache.org/jira/browse/PARQUET-2080> to deprecate the
> field
> in the format.
>
> Regards,
> Gabor
>
> On Fri, Aug 27, 2021 at 11:11 AM Gabor Szadovszky <[email protected]>
> wrote:
>
> > Hi everyone,
> >
> > It turned out that since parquet-mr 1.12.0 in certain conditions we write
> > wrong values into ColumnMetaData.dictionary_page_offset
> > <
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L753>
> and
> > ColumnChunk.file_offset
> > <
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L790
> >.
> > See details in PARQUET-2078
> > <
> https://issues.apache.org/jira/browse/PARQUET-2078?focusedCommentId=17405527&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17405527
> >.
> > Because of that any implementations that use these values have to be
> > prepared for potential invalid values in case the file is written by
> > parquet-mr 1.12.0.
> >
> > As per my understanding of the issue (to be verified) the
> > distinguish between valid and invalid values of these offsets is quite
> > easy: dictionary_page_offset is set to a value while the column chunk is
> > not dictionary encoded (as per ColumnMetaData.encodings
> > <
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L725
> >).
> > In this case we have to use the offset of the first data page
> > <
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L747
> >
> > in the first column chunk of the row group.
> >
> > Regards,
> > Gabor
> >
>

Reply via email to