Mani, please send a reply to dev-unsubscr...@arrow.apache.org to remove yourself from the list.
On Thu, Jul 11, 2019 at 11:10 AM mani vannan <mani.van...@analyticswise.com> wrote: > All, > > Can someone please help me to unsubscribe to this group? > > Thank you. > > -----Original Message----- > From: Radev, Martin <martin.ra...@tum.de> > Sent: Thursday, July 11, 2019 2:08 PM > To: dev@arrow.apache.org; emkornfi...@gmail.com > Cc: Raoofy, Amir <amir.rao...@tum.de>; Karlstetter, Roman < > roman.karlstet...@tum.de> > Subject: Re: Adding a new encoding for FP data > > Hello Micah, > > > the changes will go to the C++ implementation of Parquet within Arrow. > > In that sense, if Arrow uses the compression and encoding methods > available in Parquet in any way, I expect a benefit. > > > My plan is to add the new encoding to parquet-cpp and parquer-mr (java). > > > If you have any more questions or concerns, let me know. > > I am close to done with my patch. > > > Regards, > > Martin > > > ________________________________ > From: Micah Kornfield <emkornfi...@gmail.com> > Sent: Thursday, July 11, 2019 5:26:26 PM > To: dev@arrow.apache.org > Cc: Raoofy, Amir; Karlstetter, Roman > Subject: Re: Adding a new encoding for FP data > > Hi Martin, > Can you clarify were you expecting the encoding to only be used in > Parquet, or more generally in Arrow? > > Thanks, > Micah > > On Thu, Jul 11, 2019 at 7:06 AM Wes McKinney <wesmck...@gmail.com> wrote: > > > hi folks, > > > > If you could participate in Micah's discussion about compression and > > encoding generally at > > > > > > https://lists.apache.org/thread.html/a99124e57c14c3c9ef9d98f3c80cfe1dd > > 25496bf3ff7046778add937@%3Cdev.arrow.apache.org%3E > > > > it would be helpful. I personally think that Arrow would benefit from > > an alternate protocol message type to the current RecordBatch (as > > defined in Message.fbs) that allows for encoded or compressed columns. > > This won't be an overnight change (more on the order of months of > > work), but it's worth taking the time to carefully consider the > > implications of developing and supporting such a feature for the long > > term > > > > On Thu, Jul 11, 2019 at 5:34 AM Fan Liya <liya.fa...@gmail.com> wrote: > > > > > > Hi Radev, > > > > > > Thanks a lot for providing so much technical details. I need to read > > > them carefully. > > > > > > I think FP encoding is definitely a useful feature. > > > I hope this feature can be implemented in Arrow soon, so that we can > > > use > > it > > > in our system. > > > > > > Best, > > > Liya Fan > > > > > > On Thu, Jul 11, 2019 at 5:55 PM Radev, Martin <martin.ra...@tum.de> > > wrote: > > > > > > > Hello Liya Fan, > > > > > > > > > > > > this explains the technique but for a more complex case: > > > > > > > > > > https://fgiesen.wordpress.com/2011/01/24/x86-code-compression-in-kkrun > > chy/ > [https://s0.wp.com/i/blank.jpg]< > https://fgiesen.wordpress.com/2011/01/24/x86-code-compression-in-kkrunchy/ > > > > x86 code compression in kkrunchy | The ryg blog< > https://fgiesen.wordpress.com/2011/01/24/x86-code-compression-in-kkrunchy/ > > > fgiesen.wordpress.com > This is about the "secret ingredient" in my EXE packer kkrunchy, which was > used in our (Farbrausch) 64k intros starting from "fr-030: Candytron", and > also in a lot of other ... > > > > > > > > > > > For FP data, the approach which seemed to be the best is the > following. > > > > > > > > Say we have a buffer of two 32-bit floating point values: > > > > > > > > buf = [af, bf] > > > > > > > > We interpret each FP value as a 32-bit uint and look at each > > > > individual byte. We have 8 bytes in total for this small input. > > > > > > > > buf = [af0, af1, af2, af3, bf0, bf1, bf2, bf3] > > > > > > > > Then we apply stream splitting and the new buffer becomes: > > > > > > > > newbuf = [af0, bf0, af1, bf1, af2, bf2, af3, bf3] > > > > > > > > We compress newbuf. > > > > > > > > Due to similarities the sign bits, mantissa bits and MSB exponent > > bits, we > > > > might have a lot more repetitions in data. For scientific data, > > > > the > > 2nd and > > > > 3rd byte for 32-bit data is probably largely noise. Thus in the > > original > > > > representation we would always have a few bytes of data which > > > > could > > appear > > > > somewhere else in the buffer and then a couple bytes of possible > > noise. In > > > > the new representation we have a long stream of data which could > > compress > > > > well and then a sequence of noise towards the end. > > > > > > > > This transformation improved compression ratio as can be seen in > > > > the report. > > > > > > > > It also improved speed for ZSTD. This could be because ZSTD makes > > > > a decision of how to compress the data - RLE, new huffman tree, > > > > huffman > > tree > > > > of the previous frame, raw representation. Each can potentially > > achieve a > > > > different compression ratio and compression/decompression speed. > > > > It > > turned > > > > out that when the transformation is applied, zstd would attempt to > > compress > > > > fewer frames and copy the other. This could lead to less attempts > > > > to > > build > > > > a huffman tree. It's hard to pin-point the exact reason. > > > > > > > > I did not try other lossless text compressors but I expect similar > > results. > > > > > > > > For code, I can polish my patches, create a Jira task and submit > > > > the patches for review. > > > > > > > > > > > > Regards, > > > > > > > > Martin > > > > > > > > > > > > ________________________________ > > > > From: Fan Liya <liya.fa...@gmail.com> > > > > Sent: Thursday, July 11, 2019 11:32:53 AM > > > > To: dev@arrow.apache.org > > > > Cc: Raoofy, Amir; Karlstetter, Roman > > > > Subject: Re: Adding a new encoding for FP data > > > > > > > > Hi Radev, > > > > > > > > Thanks for the information. It seems interesting. > > > > IMO, Arrow has much to do for data compression. However, it seems > > there are > > > > some differences for memory data compression and external storage > > > > data compression. > > > > > > > > Could you please provide some reference for stream splitting? > > > > > > > > Best, > > > > Liya Fan > > > > > > > > On Thu, Jul 11, 2019 at 5:15 PM Radev, Martin > > > > <martin.ra...@tum.de> > > wrote: > > > > > > > > > Hello people, > > > > > > > > > > > > > > > there has been discussion in the Apache Parquet mailing list on > > adding a > > > > > new encoder for FP data. > > > > > The reason for this is that the supported compressors by Apache > > Parquet > > > > > (zstd, gzip, etc) do not compress well raw FP data. > > > > > > > > > > > > > > > In my investigation it turns out that a very simple simple > > > > > technique, named stream splitting, can improve the compression > > > > > ratio and even > > speed > > > > > for some of the compressors. > > > > > > > > > > You can read about the results here: > > > > > > > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view > [ > https://lh3.googleusercontent.com/0Nyayc6yUgH07IH12mpd3FJ8OZ7MX282uaxarQ0ffc5sJT_-hiMR5aw60Yg=w1200-h630-p > ]<https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view> > > report.pdf - Google Drive< > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view> > drive.google.com > > > > > > > > > > > > > > > > > > I went through the developer guide for Apache Arrow and wrote a > > patch to > > > > > add the new encoding and test coverage for it. > > > > > > > > > > I will polish my patch and work in parallel to extend the Apache > > Parquet > > > > > format for the new encoding. > > > > > > > > > > > > > > > If you have any concerns, please let me know. > > > > > > > > > > > > > > > Regards, > > > > > > > > > > Martin > > > > > > > > > > > > > > > > >