[
https://issues.apache.org/jira/browse/ARROW-10353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217634#comment-17217634
]
Wes McKinney commented on ARROW-10353:
--------------------------------------
Note that DataPageV2 is not recommended for production use
> [C++] Parquet decompresses DataPageV2 pages even if is_compressed==0
> --------------------------------------------------------------------
>
> Key: ARROW-10353
> URL: https://issues.apache.org/jira/browse/ARROW-10353
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++
> Reporter: Jan Finis
> Priority: Major
> Fix For: 3.0.0
>
>
> According to the parquet-format specification, DataPageV2 pages have an
> is_compressed flag. Even if the column chunk has a decompression codec set,
> the page is only compressed if this flag is true (this likely enables not
> compressing some pages where the compression wouldn't save memory).
> Here is the relevant excerpt from parquet.thrift describing the semantics of
> the is_compressed flag in a DataPageV2:
> _/** whether the values are compressed._
> _Which means the section of the page between_
> _definition_levels_byte_length + repetition_levels_byte_length + 1 and
> compressed_page_size (included)_
> _is compressed with the compression_codec._
> _If missing it is considered compressed */_
> _7: optional bool is_compressed = 1;_
>
> It seems that the apache parquet cpp library (haven't checked other languages
> but might have the bug as well) totally disregards this flag and decompresses
> the page in all cases if a decompressor is set for the column chunk.
> The erroneous code is in column_reader.cc:
> std::shared_ptr<Page> SerializedPageReader::NextPage()
> This method first decompresses the page if there is a decompressor set and
> only then does a case distinction on whether this page is a DataPageV2 and
> has the is_compressed flag. Thus, even if the page would have this flag set
> to 0, the page would be decompressed anyway.
> The method that should use the is_compressed flag but doesn't is:
> std::shared_ptr<Buffer> SerializedPageReader::DecompressPage
> This method doesn't look at the is_compressed flag at all.
>
> The reason why this bug probably doesn't show in any unit test is that the
> write implementation seems to do the same mistake: It always compresses the
> page, even if the page has its is_compressed flag set to false.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)