Thank you Wes. If the stars line up I’d be interested in joining and contributing to this effort. I have a ton of ideas around efficient encodings for different types of data.
> On Mar 10, 2020, at 2:52 PM, Wes McKinney <wesmck...@gmail.com> wrote: > > See this past mailing list thread > > https://lists.apache.org/thread.html/a99124e57c14c3c9ef9d98f3c80cfe1dd25496bf3ff7046778add937%40%3Cdev.arrow.apache.org%3E > > and associated PR > > https://github.com/apache/arrow/pull/4815 > > There hasn't been a lot of movement on this but primarily because all > the key people who've expressed interest in it have been really busy > with other matters (myself included). Have RLE-encoding in memory at > minimum would be a huge benefit for a number of applications, so it > would be great to continue the discussion and create a more > comprehensive proposal document describing what we would like to > implement (and what we do not want to implement) > > On Tue, Mar 10, 2020 at 3:41 AM Radev, Martin <martin.ra...@tum.de> wrote: >> >> Hey Evan, >> >> >> thank you for the interest. >> >> There has been some effort for compressing floating-point data on the >> Parquet side, namely the BYTE_STREAM_SPLIT encoding. On its own it does not >> compress floating point data but makes it more compressible for when a >> compressor, such as ZSTD, LZ4, etc, is used. It only works well for >> high-entropy floating-point data, somewhere at least as large as >= 15 bits >> of entropy per element. I suppose the encoding might actually also make >> sense for high-entropy integer data but I am not super sure. >> For low-entropy data, the dictionary encoding is good though I suspect there >> can be room for performance improvements. >> This is my final report for the encoding here: >> https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/report_final.pdf >> >> Note that at some point my investigation turned out be quite the same >> solution as the one in https://github.com/powturbo/Turbo-Transpose. >> >> >> Maybe the points I sent can be helpful. >> >> >> Kinds regards, >> >> Martin >> >> ________________________________ >> From: evan_c...@apple.com <evan_c...@apple.com> on behalf of Evan Chan >> <evan_c...@apple.com.INVALID> >> Sent: Tuesday, March 10, 2020 5:15:48 AM >> To: dev@arrow.apache.org >> Subject: Summary of RLE and other compression efforts? >> >> Hi folks, >> >> I’m curious about the state of efforts for more compressed encodings in the >> Arrow columnar format. I saw discussions previously about RLE, but is there >> a place to summarize all of the different efforts that are ongoing to bring >> more compressed encodings? >> >> Is there an effort to compress floating point or integer data using >> techniques such as XOR compression and Delta-Delta? I can contribute to >> some of these efforts as well. >> >> Thanks, >> Evan >> >>