Hey Evan,
thank you for the interest. There has been some effort for compressing floating-point data on the Parquet side, namely the BYTE_STREAM_SPLIT encoding. On its own it does not compress floating point data but makes it more compressible for when a compressor, such as ZSTD, LZ4, etc, is used. It only works well for high-entropy floating-point data, somewhere at least as large as >= 15 bits of entropy per element. I suppose the encoding might actually also make sense for high-entropy integer data but I am not super sure. For low-entropy data, the dictionary encoding is good though I suspect there can be room for performance improvements. This is my final report for the encoding here: https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/report_final.pdf Note that at some point my investigation turned out be quite the same solution as the one in https://github.com/powturbo/Turbo-Transpose. Maybe the points I sent can be helpful. Kinds regards, Martin ________________________________ From: evan_c...@apple.com <evan_c...@apple.com> on behalf of Evan Chan <evan_c...@apple.com.INVALID> Sent: Tuesday, March 10, 2020 5:15:48 AM To: dev@arrow.apache.org Subject: Summary of RLE and other compression efforts? Hi folks, I’m curious about the state of efforts for more compressed encodings in the Arrow columnar format. I saw discussions previously about RLE, but is there a place to summarize all of the different efforts that are ongoing to bring more compressed encodings? Is there an effort to compress floating point or integer data using techniques such as XOR compression and Delta-Delta? I can contribute to some of these efforts as well. Thanks, Evan