[
https://issues.apache.org/jira/browse/PARQUET-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wes McKinney resolved PARQUET-1716.
-----------------------------------
Fix Version/s: cpp-1.6.0
Resolution: Fixed
Issue resolved by pull request 6005
[https://github.com/apache/arrow/pull/6005]
> [C++] Add support for BYTE_STREAM_SPLIT encoding
> ------------------------------------------------
>
> Key: PARQUET-1716
> URL: https://issues.apache.org/jira/browse/PARQUET-1716
> Project: Parquet
> Issue Type: New Feature
> Components: parquet-cpp
> Reporter: Martin Radev
> Assignee: Martin Radev
> Priority: Minor
> Labels: pull-request-available
> Fix For: cpp-1.6.0
>
> Original Estimate: 72h
> Time Spent: 14h
> Remaining Estimate: 58h
>
> *From the Parquet issue ( https://issues.apache.org/jira/browse/PARQUET-1622
> ):*
> Apache Parquet does not have any encodings suitable for FP data and the
> available text compressors (zstd, gzip, etc) do not handle FP data very well.
> It is possible to apply a simple data transformation named "stream
> splitting". Such could be "byte stream splitting" which creates K streams of
> length N where K is the number of bytes in the data type (4 for floats, 8 for
> doubles) and N is the number of elements in the sequence.
> The transformed data compresses significantly better on average than the
> original data and for some cases there is a performance improvement in
> compression and decompression speed.
> You can read a more detailed report here:
> [https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
> *Apache Arrow can benefit from the reduced requirements for storing FP
> parquet column data and improvements in decompression speed.*
--
This message was sent by Atlassian Jira
(v8.3.4#803005)