See this past mailing list thread https://lists.apache.org/thread.html/a99124e57c14c3c9ef9d98f3c80cfe1dd25496bf3ff7046778add937%40%3Cdev.arrow.apache.org%3E
and associated PR https://github.com/apache/arrow/pull/4815 There hasn't been a lot of movement on this but primarily because all the key people who've expressed interest in it have been really busy with other matters (myself included). Have RLE-encoding in memory at minimum would be a huge benefit for a number of applications, so it would be great to continue the discussion and create a more comprehensive proposal document describing what we would like to implement (and what we do not want to implement) On Tue, Mar 10, 2020 at 3:41 AM Radev, Martin <martin.ra...@tum.de> wrote: > > Hey Evan, > > > thank you for the interest. > > There has been some effort for compressing floating-point data on the Parquet > side, namely the BYTE_STREAM_SPLIT encoding. On its own it does not compress > floating point data but makes it more compressible for when a compressor, > such as ZSTD, LZ4, etc, is used. It only works well for high-entropy > floating-point data, somewhere at least as large as >= 15 bits of entropy per > element. I suppose the encoding might actually also make sense for > high-entropy integer data but I am not super sure. > For low-entropy data, the dictionary encoding is good though I suspect there > can be room for performance improvements. > This is my final report for the encoding here: > https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/report_final.pdf > > Note that at some point my investigation turned out be quite the same > solution as the one in https://github.com/powturbo/Turbo-Transpose. > > > Maybe the points I sent can be helpful. > > > Kinds regards, > > Martin > > ________________________________ > From: evan_c...@apple.com <evan_c...@apple.com> on behalf of Evan Chan > <evan_c...@apple.com.INVALID> > Sent: Tuesday, March 10, 2020 5:15:48 AM > To: dev@arrow.apache.org > Subject: Summary of RLE and other compression efforts? > > Hi folks, > > I’m curious about the state of efforts for more compressed encodings in the > Arrow columnar format. I saw discussions previously about RLE, but is there > a place to summarize all of the different efforts that are ongoing to bring > more compressed encodings? > > Is there an effort to compress floating point or integer data using > techniques such as XOR compression and Delta-Delta? I can contribute to some > of these efforts as well. > > Thanks, > Evan > >