[
https://issues.apache.org/jira/browse/ARROW-10353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jan Finis updated ARROW-10353:
------------------------------
Description:
According to the parquet-format specification, DataPageV2 pages have an
is_compressed flag. Even if the column chunk has a decompression codec set, the
page is only compressed if this flag is true (this likely enables not
compressing some pages where the compression wouldn't save memory).
Here is the relevant excerpt from parquet.thrift describing the semantics of
the is_compressed flag in a DataPageV2:
_/** whether the values are compressed._
_Which means the section of the page between_
_definition_levels_byte_length + repetition_levels_byte_length + 1 and
compressed_page_size (included)_
_is compressed with the compression_codec._
_If missing it is considered compressed */_
_7: optional bool is_compressed = 1;_
It seems that the apache parquet cpp library (haven't checked other languages
but might have the bug as well) totally disregards this flag and decompresses
the page in all cases if a decompressor is set for the column chunk.
The erroneous code is in column_reader.cc:
std::shared_ptr<Page> SerializedPageReader::NextPage()
This method first decompresses the page if there is a decompressor set and only
then does a case distinction on whether this page is a DataPageV2 and has the
is_compressed flag. Thus, even if the page would have this flag set to 0, the
page would be decompressed anyway.
was:
According to the parquet-format specification, DataPageV2 pages have an
is_compressed flag. Even if the column chunk has a decompression codec set, the
page is only compressed if this flag is true (this likely enables not
compressing some pages where the compression wouldn't save memory).
Here is the relevant excerpt from parquet.thrift describing the semantics of
the is_compressed flag in a DataPageV2:
_/** whether the values are compressed._
_Which means the section of the page between_
_definition_levels_byte_length + repetition_levels_byte_length + 1 and
compressed_page_size (included)_
_is compressed with the compression_codec._
_If missing it is considered compressed */_
_7: optional bool is_compressed = 1;_
It seems that the apache parquet cpp library (haven't checked other languages
but might have the bug as well) totally disregard this flag and decompress the
page in all cases if a decompressor is set for the column chunk.
The erroneous code is in column_reader.cc:
std::shared_ptr<Page> SerializedPageReader::NextPage()
This method first decompresses the page if there is a decompressor set and only
then does a case distinction on whether this page is a DataPageV2 and has the
is_compressed flag. Thus, even if the page would have this flag set to 0, the
page would be decompressed anyway.
> Arrow Parquet Cpp decompresses DataPageV2 pages even if is_compressed==0
> ------------------------------------------------------------------------
>
> Key: ARROW-10353
> URL: https://issues.apache.org/jira/browse/ARROW-10353
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++
> Reporter: Jan Finis
> Priority: Major
>
> According to the parquet-format specification, DataPageV2 pages have an
> is_compressed flag. Even if the column chunk has a decompression codec set,
> the page is only compressed if this flag is true (this likely enables not
> compressing some pages where the compression wouldn't save memory).
> Here is the relevant excerpt from parquet.thrift describing the semantics of
> the is_compressed flag in a DataPageV2:
> _/** whether the values are compressed._
> _Which means the section of the page between_
> _definition_levels_byte_length + repetition_levels_byte_length + 1 and
> compressed_page_size (included)_
> _is compressed with the compression_codec._
> _If missing it is considered compressed */_
> _7: optional bool is_compressed = 1;_
>
> It seems that the apache parquet cpp library (haven't checked other languages
> but might have the bug as well) totally disregards this flag and decompresses
> the page in all cases if a decompressor is set for the column chunk.
> The erroneous code is in column_reader.cc:
> std::shared_ptr<Page> SerializedPageReader::NextPage()
> This method first decompresses the page if there is a decompressor set and
> only then does a case distinction on whether this page is a DataPageV2 and
> has the is_compressed flag. Thus, even if the page would have this flag set
> to 0, the page would be decompressed anyway.
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)