Hi Martin,
Can you clarify were you expecting the encoding to only be used in Parquet,
or more generally in Arrow?

Thanks,
Micah

On Thu, Jul 11, 2019 at 7:06 AM Wes McKinney <wesmck...@gmail.com> wrote:

> hi folks,
>
> If you could participate in Micah's discussion about compression and
> encoding generally at
>
>
> https://lists.apache.org/thread.html/a99124e57c14c3c9ef9d98f3c80cfe1dd25496bf3ff7046778add937@%3Cdev.arrow.apache.org%3E
>
> it would be helpful. I personally think that Arrow would benefit from
> an alternate protocol message type to the current RecordBatch (as
> defined in Message.fbs) that allows for encoded or compressed columns.
> This won't be an overnight change (more on the order of months of
> work), but it's worth taking the time to carefully consider the
> implications of developing and supporting such a feature for the long
> term
>
> On Thu, Jul 11, 2019 at 5:34 AM Fan Liya <liya.fa...@gmail.com> wrote:
> >
> > Hi Radev,
> >
> > Thanks a lot for providing so much technical details. I need to read them
> > carefully.
> >
> > I think FP encoding is definitely a useful feature.
> > I hope this feature can be implemented in Arrow soon, so that we can use
> it
> > in our system.
> >
> > Best,
> > Liya Fan
> >
> > On Thu, Jul 11, 2019 at 5:55 PM Radev, Martin <martin.ra...@tum.de>
> wrote:
> >
> > > Hello Liya Fan,
> > >
> > >
> > > this explains the technique but for a more complex case:
> > >
> > >
> https://fgiesen.wordpress.com/2011/01/24/x86-code-compression-in-kkrunchy/
> > >
> > > For FP data, the approach which seemed to be the best is the following.
> > >
> > > Say we have a buffer of two 32-bit floating point values:
> > >
> > > buf = [af, bf]
> > >
> > > We interpret each FP value as a 32-bit uint and look at each individual
> > > byte. We have 8 bytes in total for this small input.
> > >
> > > buf = [af0, af1, af2, af3, bf0, bf1, bf2, bf3]
> > >
> > > Then we apply stream splitting and the new buffer becomes:
> > >
> > > newbuf = [af0, bf0, af1, bf1, af2, bf2, af3, bf3]
> > >
> > > We compress newbuf.
> > >
> > > Due to similarities the sign bits, mantissa bits and MSB exponent
> bits, we
> > > might have a lot more repetitions in data. For scientific data, the
> 2nd and
> > > 3rd byte for 32-bit data is probably largely noise. Thus in the
> original
> > > representation we would always have a few bytes of data which could
> appear
> > > somewhere else in the buffer and then a couple bytes of possible
> noise. In
> > > the new representation we have a long stream of data which could
> compress
> > > well and then a sequence of noise towards the end.
> > >
> > > This transformation improved compression ratio as can be seen in the
> > > report.
> > >
> > > It also improved speed for ZSTD. This could be because ZSTD makes a
> > > decision of how to compress the data - RLE, new huffman tree, huffman
> tree
> > > of the previous frame, raw representation. Each can potentially
> achieve a
> > > different compression ratio and compression/decompression speed. It
> turned
> > > out that when the transformation is applied, zstd would attempt to
> compress
> > > fewer frames and copy the other. This could lead to less attempts to
> build
> > > a huffman tree. It's hard to pin-point the exact reason.
> > >
> > > I did not try other lossless text compressors but I expect similar
> results.
> > >
> > > For code, I can polish my patches, create a Jira task and submit the
> > > patches for review.
> > >
> > >
> > > Regards,
> > >
> > > Martin
> > >
> > >
> > > ________________________________
> > > From: Fan Liya <liya.fa...@gmail.com>
> > > Sent: Thursday, July 11, 2019 11:32:53 AM
> > > To: dev@arrow.apache.org
> > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > Subject: Re: Adding a new encoding for FP data
> > >
> > > Hi Radev,
> > >
> > > Thanks for the information. It seems interesting.
> > > IMO, Arrow has much to do for data compression. However, it seems
> there are
> > > some differences for memory data compression and external storage data
> > > compression.
> > >
> > > Could you please provide some reference for stream splitting?
> > >
> > > Best,
> > > Liya Fan
> > >
> > > On Thu, Jul 11, 2019 at 5:15 PM Radev, Martin <martin.ra...@tum.de>
> wrote:
> > >
> > > > Hello people,
> > > >
> > > >
> > > > there has been discussion in the Apache Parquet mailing list on
> adding a
> > > > new encoder for FP data.
> > > > The reason for this is that the supported compressors by Apache
> Parquet
> > > > (zstd, gzip, etc) do not compress well raw FP data.
> > > >
> > > >
> > > > In my investigation it turns out that a very simple simple technique,
> > > > named stream splitting, can improve the compression ratio and even
> speed
> > > > for some of the compressors.
> > > >
> > > > You can read about the results here:
> > > >
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
> > > >
> > > >
> > > > I went through the developer guide for Apache Arrow and wrote a
> patch to
> > > > add the new encoding and test coverage for it.
> > > >
> > > > I will polish my patch and work in parallel to extend the Apache
> Parquet
> > > > format for the new encoding.
> > > >
> > > >
> > > > If you have any concerns, please let me know.
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Martin
> > > >
> > > >
> > >
>

Reply via email to