Jan Finis created ARROW-10353:
---------------------------------

             Summary: Arrow Parquet Cpp decompresses DataPageV2 pages even if 
is_compressed==0
                 Key: ARROW-10353
                 URL: https://issues.apache.org/jira/browse/ARROW-10353
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++
            Reporter: Jan Finis


According to the parquet-format specification, DataPageV2 pages have an 
is_compressed flag. Even if the column chunk has a decompression codec set, the 
page is only compressed if this flag is true (this likely enables not 
compressing some pages where the compression wouldn't save memory).

Here is the relevant excerpt from parquet.thrift describing the semantics of 
the is_compressed flag in a DataPageV2:

 _/** whether the values are compressed._
 _Which means the section of the page between_
 _definition_levels_byte_length + repetition_levels_byte_length + 1 and 
compressed_page_size (included)_
 _is compressed with the compression_codec._
 _If missing it is considered compressed */_
 _7: optional bool is_compressed = 1;_

 

It seems that the apache parquet cpp library (haven't checked other languages 
but might have the bug as well) totally disregard this flag and decompress the 
page in all cases if a decompressor is set for the column chunk.

The erroneous code is in column_reader.cc: 

std::shared_ptr<Page> SerializedPageReader::NextPage() 


This method first decompresses the page if there is a decompressor set and only 
then does a case distinction on whether this page is a DataPageV2 and has the 
is_compressed flag. Thus, even if the page would have this flag set to 0, the 
page would be decompressed anyway.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to