Hey Evan,

thank you for the interest.

There has been some effort for compressing floating-point data on the Parquet 
side, namely the BYTE_STREAM_SPLIT encoding. On its own it does not compress 
floating point data but makes it more compressible for when a compressor, such 
as ZSTD, LZ4, etc, is used. It only works well for high-entropy floating-point 
data, somewhere at least as large as >= 15 bits of entropy per element. I 
suppose the encoding might actually also make sense for high-entropy integer 
data but I am not super sure.
For low-entropy data, the dictionary encoding is good though I suspect there 
can be room for performance improvements.
This is my final report for the encoding here: 
https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/report_final.pdf

Note that at some point my investigation turned out be quite the same solution 
as the one in https://github.com/powturbo/Turbo-Transpose.


Maybe the points I sent can be helpful.


Kinds regards,

Martin

________________________________
From: evan_c...@apple.com <evan_c...@apple.com> on behalf of Evan Chan 
<evan_c...@apple.com.INVALID>
Sent: Tuesday, March 10, 2020 5:15:48 AM
To: dev@arrow.apache.org
Subject: Summary of RLE and other compression efforts?

Hi folks,

I’m curious about the state of efforts for more compressed encodings in the 
Arrow columnar format.  I saw discussions previously about RLE, but is there a 
place to summarize all of the different efforts that are ongoing to bring more 
compressed encodings?

Is there an effort to compress floating point or integer data using techniques 
such as XOR compression and Delta-Delta?  I can contribute to some of these 
efforts as well.

Thanks,
Evan


Reply via email to