Dear all,

there was some earlier discussion on adding a new encoding for better 
compression of FP32 and FP64 data.


The pull request which extends the format is here: 
https://github.com/apache/parquet-format/pull/144
The change has one approval from earlier from Zoltan.


The results from an investigation on compression ratio and speed with the new 
encoding vs other encodings is available here: 
https://github.com/martinradev/arrow-fp-compression-bench
It is visible that for many tests the new encoding performs better in 
compression ratio and in some cases in speed. The improvements in compression 
speed come from the fact that the new format can potentially lead to a faster 
parsing for some compressors like GZIP.


An earlier report which examines other FP compressors (fpzip, spdp, fpc, zfp, 
sz) and new potential encodings is available here: 
https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
The report also covers lossy compression but the BYTE_STREAM_SPLIT encoding 
only has the focus of lossless compression.


Can we have a vote?


Regards,

Martin

Reply via email to