ASF GitHub Bot commented on PARQUET-1622:

gszadovszky commented on pull request #144: PARQUET-1622: Add BYTE_STREAM_SPLIT 
URL: https://github.com/apache/parquet-format/pull/144
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:

> Adding an encoding for FP data
> ------------------------------
>                 Key: PARQUET-1622
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1622
>             Project: Parquet
>          Issue Type: Wish
>          Components: parquet-cpp, parquet-format, parquet-mr, parquet-thrift
>            Reporter: Martin Radev
>            Priority: Minor
>              Labels: features, pull-request-available
>   Original Estimate: 48h
>  Remaining Estimate: 48h
> Apache Parquet does not have any encodings suitable for FP data and the 
> available text compressors (zstd, gzip, etc) do not handle FP data very well.
> It is possible to apply a simple data transformation named "stream 
> splitting". Such could be "byte stream splitting" which creates K streams of 
> length N where K is the number of bytes in the data type (4 for floats, 8 for 
> doubles) and N is the number of elements in the sequence.
> The transformed data compresses significantly better on average than the 
> original data and for some cases there is a performance improvement in 
> compression and decompression speed.
> You can read a more detailed report here:
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view

This message was sent by Atlassian Jira

Reply via email to