Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

Ryan Blue Tue, 03 Sep 2019 14:52:35 -0700

Hi Martin,

Thanks for taking a look at this! I agree that the approach here looks
promising. We've had occasional requests for lossy floating point
compression in the past, so it would be good to add this.


I did some work in this area a few years ago that is similar and I'd like
to hear what you think about that approach compared to this one. That work
was based on the same observation, that the main problem is the mantissa,
while exponents tend to compress well. What I did was take the exponent and
mantissa and encode each separately, like the component encoding in your
test. But to encode each stream, I used Parquet's RLE encoder instead of
just applying compression. This seemed to work well for exponents and sign
bits, but probably isn't worth the cost for mantissa bits. It could also be
interesting to test a separate stream for sign bits.

I guess what I'd like to hear your take on is whether you think adding
run-length encoding to any of the byte streams would be beneficial before
applying Zstd.

Thanks!

rb

On Tue, Sep 3, 2019 at 12:30 PM Wes McKinney <wesmck...@gmail.com> wrote:

> I'm interested in this. I have been busy the last couple of weeks so have
> not been able to take a closer look. I will try to give some feedback this
> week.
>
> Thanks
>
> On Tue, Sep 3, 2019, 2:17 PM Radev, Martin <martin.ra...@tum.de> wrote:
>
> > Hello all,
> >
> >
> > thank you Julien for the interest.
> >
> >
> > Could other people, part of Apache Parquet, share their opinions?
> >
> > Do you have your own data which you would like to use for testing the new
> > encoding?
> >
> >
> > Regards,
> >
> > Martin
> >
> > ________________________________
> > From: Julien Le Dem <julien.le...@wework.com.INVALID>
> > Sent: Friday, August 30, 2019 2:38:37 AM
> > To: dev@parquet.apache.org
> > Cc: Raoofy, Amir; Karlstetter, Roman
> > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> >
> > I think this looks promising to me. At first glance it seems combining
> > simplicity and efficiency.
> > I'd like to hear more from other members of the PMC.
> >
> > On Tue, Aug 27, 2019 at 5:30 AM Radev, Martin <martin.ra...@tum.de>
> wrote:
> >
> > > Dear all,
> > >
> > >
> > > there was some earlier discussion on adding a new encoding for better
> > > compression of FP32 and FP64 data.
> > >
> > >
> > > The pull request which extends the format is here:
> > > https://github.com/apache/parquet-format/pull/144
> > > The change has one approval from earlier from Zoltan.
> > >
> > >
> > > The results from an investigation on compression ratio and speed with
> the
> > > new encoding vs other encodings is available here:
> > > https://github.com/martinradev/arrow-fp-compression-bench
> > > It is visible that for many tests the new encoding performs better in
> > > compression ratio and in some cases in speed. The improvements in
> > > compression speed come from the fact that the new format can
> potentially
> > > lead to a faster parsing for some compressors like GZIP.
> > >
> > >
> > > An earlier report which examines other FP compressors (fpzip, spdp,
> fpc,
> > > zfp, sz) and new potential encodings is available here:
> > >
> >
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> > > The report also covers lossy compression but the BYTE_STREAM_SPLIT
> > > encoding only has the focus of lossless compression.
> > >
> > >
> > > Can we have a vote?
> > >
> > >
> > > Regards,
> > >
> > > Martin
> > >
> > >
> >
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

Reply via email to