Thank you Wes.  If the stars line up I’d be interested in joining and 
contributing to this effort.   I have a ton of ideas around efficient encodings 
for different types of data.

> On Mar 10, 2020, at 2:52 PM, Wes McKinney <wesmck...@gmail.com> wrote:
> 
> See this past mailing list thread
> 
> https://lists.apache.org/thread.html/a99124e57c14c3c9ef9d98f3c80cfe1dd25496bf3ff7046778add937%40%3Cdev.arrow.apache.org%3E
> 
> and associated PR
> 
> https://github.com/apache/arrow/pull/4815
> 
> There hasn't been a lot of movement on this but primarily because all
> the key people who've expressed interest in it have been really busy
> with other matters (myself included). Have RLE-encoding in memory at
> minimum would be a huge benefit for a number of applications, so it
> would be great to continue the discussion and create a more
> comprehensive proposal document describing what we would like to
> implement (and what we do not want to implement)
> 
> On Tue, Mar 10, 2020 at 3:41 AM Radev, Martin <martin.ra...@tum.de> wrote:
>> 
>> Hey Evan,
>> 
>> 
>> thank you for the interest.
>> 
>> There has been some effort for compressing floating-point data on the 
>> Parquet side, namely the BYTE_STREAM_SPLIT encoding. On its own it does not 
>> compress floating point data but makes it more compressible for when a 
>> compressor, such as ZSTD, LZ4, etc, is used. It only works well for 
>> high-entropy floating-point data, somewhere at least as large as >= 15 bits 
>> of entropy per element. I suppose the encoding might actually also make 
>> sense for high-entropy integer data but I am not super sure.
>> For low-entropy data, the dictionary encoding is good though I suspect there 
>> can be room for performance improvements.
>> This is my final report for the encoding here: 
>> https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/report_final.pdf
>> 
>> Note that at some point my investigation turned out be quite the same 
>> solution as the one in https://github.com/powturbo/Turbo-Transpose.
>> 
>> 
>> Maybe the points I sent can be helpful.
>> 
>> 
>> Kinds regards,
>> 
>> Martin
>> 
>> ________________________________
>> From: evan_c...@apple.com <evan_c...@apple.com> on behalf of Evan Chan 
>> <evan_c...@apple.com.INVALID>
>> Sent: Tuesday, March 10, 2020 5:15:48 AM
>> To: dev@arrow.apache.org
>> Subject: Summary of RLE and other compression efforts?
>> 
>> Hi folks,
>> 
>> I’m curious about the state of efforts for more compressed encodings in the 
>> Arrow columnar format.  I saw discussions previously about RLE, but is there 
>> a place to summarize all of the different efforts that are ongoing to bring 
>> more compressed encodings?
>> 
>> Is there an effort to compress floating point or integer data using 
>> techniques such as XOR compression and Delta-Delta?  I can contribute to 
>> some of these efforts as well.
>> 
>> Thanks,
>> Evan
>> 
>> 

Reply via email to