See this past mailing list thread https://lists.apache.org/thread.html/a99124e57c14c3c9ef9d98f3c80cfe1dd25496bf3ff7046778add937%40%3Cdev.arrow.apache.org%3E
and associated PR https://github.com/apache/arrow/pull/4815 There hasn't been a lot of movement on this but primarily because all the key people who've expressed interest in it have been really busy with other matters (myself included). Have RLE-encoding in memory at minimum would be a huge benefit for a number of applications, so it would be great to continue the discussion and create a more comprehensive proposal document describing what we would like to implement (and what we do not want to implement) On Tue, Mar 10, 2020 at 3:41 AM Radev, Martin <[email protected]> wrote: > > Hey Evan, > > > thank you for the interest. > > There has been some effort for compressing floating-point data on the Parquet > side, namely the BYTE_STREAM_SPLIT encoding. On its own it does not compress > floating point data but makes it more compressible for when a compressor, > such as ZSTD, LZ4, etc, is used. It only works well for high-entropy > floating-point data, somewhere at least as large as >= 15 bits of entropy per > element. I suppose the encoding might actually also make sense for > high-entropy integer data but I am not super sure. > For low-entropy data, the dictionary encoding is good though I suspect there > can be room for performance improvements. > This is my final report for the encoding here: > https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/report_final.pdf > > Note that at some point my investigation turned out be quite the same > solution as the one in https://github.com/powturbo/Turbo-Transpose. > > > Maybe the points I sent can be helpful. > > > Kinds regards, > > Martin > > ________________________________ > From: [email protected] <[email protected]> on behalf of Evan Chan > <[email protected]> > Sent: Tuesday, March 10, 2020 5:15:48 AM > To: [email protected] > Subject: Summary of RLE and other compression efforts? > > Hi folks, > > I’m curious about the state of efforts for more compressed encodings in the > Arrow columnar format. I saw discussions previously about RLE, but is there > a place to summarize all of the different efforts that are ongoing to bring > more compressed encodings? > > Is there an effort to compress floating point or integer data using > techniques such as XOR compression and Delta-Delta? I can contribute to some > of these efforts as well. > > Thanks, > Evan > >
