Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

Radev, Martin Thu, 19 Sep 2019 06:11:47 -0700

Hello Ryan,

we decided that it would be beneficial to try out your proposal.

I will look into it and provide measurements on the compression ratio and speed.

Regards,

Martin

________________________________
From: Ryan Blue <[email protected]>
Sent: Saturday, September 14, 2019 2:23:20 AM
To: Radev, Martin
Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

> Using RLE for the sign, exponents and the top-most mantissa bytes can help 
> when data is repetitive and make it worse for other.

I agree. But we use RLE in similar cases because we do tend to have runs of 
values, and values that fit in a fixed number of bits. Exponents and sign bits 
would probably fit this model extremely well most of the time if you have 
similar floating point values or sorted values. It would be really interesting 
to see how well this performs in comparison to the compression tests you've 
already done. For mantissa bits, I agree it wouldn't be worth encoding first.

On Thu, Sep 12, 2019 at 2:56 AM Radev, Martin 
<[email protected]<mailto:[email protected]>> wrote:

Hello Ryan, Wes and other parquet devs,

thanks for the response. I was away on vacation and that's why I am answering 
just now.

> whether you think adding run-length encoding to any of the byte streams would 
> be beneficial before applying Zstd.
The short answer is "only for some cases but it will make it worse in both 
compression ratio and speed for other".

Our initial investigation also separated the sign, exponent and mantissa into 
separate streams.

The encoding was the following assuming 32-bit IEEE754:

- stream of sign bits

- stream of exponents bits. Conveniently the exponent for a 32-bit IEEE754 
number is 8 bits.

- separate the remaining 23 bits into four streams of 8, 8, 7 bits. An extra 
zero bit is added to the block which has only seven bits. This was done since 
zstd, zlib, etc work at a byte granularity and we would want repetitions to 
happen at such.

For 64-bit IEEE754 even more padding has to be added since the exponent is 11 
bits and the mantissa is 52 bits. Thus, we have to add 5 more exponent bits and 
4 more mantissa bits to keep repetitions at a byte granularity. My original 
report shows results for when the floating-point values are split at a 
component granularity. Report is here: 
https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
Results are just slightly better in terms of compression ratio for some tests 
but compression and decompression speed is expectedly worse. The reason is that 
splitting a value is somewhat more complex. We need to keep a stream of bits 
for the signs, keep track of when a byte in the stream is exhausted, do bit 
manipulation to extract components, etc. This is also the reason why I 
preferred to go with the byte-wise decomposition of the values. It's faster and 
the compression ratio is just slightly worse for some of the tests.

Using RLE for the sign, exponents and the top-most mantissa bytes can help when 
data is repetitive and make it worse for other. I suppose using one of the 
compressors yields a better compression ratio on average. Also, this can again 
make encoding and decoding slower.

The design of the BYTE_STREAM_SPLIT encoding had in mind two things:

- It would only make data more compressible and leave compression to the codec 
in use.
  This leaves the complexity to the codec and choice of speed/compression ratio 
to the user.

- It should be fast.
  There's an extra compression step so preferably there's very little latency 
before it.

@Wes, can you have a look?

More opinions are welcome.

If you have floating point data available, I would be very happy to examine 
whether this approach offers benefit for you.

Regards,

Martin

________________________________
From: Ryan Blue <[email protected]>
Sent: Tuesday, September 3, 2019 11:51:46 PM
To: Parquet Dev
Cc: Raoofy, Amir; Karlstetter, Roman
Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

Hi Martin,

Thanks for taking a look at this! I agree that the approach here looks
promising. We've had occasional requests for lossy floating point
compression in the past, so it would be good to add this.

I did some work in this area a few years ago that is similar and I'd like
to hear what you think about that approach compared to this one. That work
was based on the same observation, that the main problem is the mantissa,
while exponents tend to compress well. What I did was take the exponent and
mantissa and encode each separately, like the component encoding in your
test. But to encode each stream, I used Parquet's RLE encoder instead of
just applying compression. This seemed to work well for exponents and sign
bits, but probably isn't worth the cost for mantissa bits. It could also be
interesting to test a separate stream for sign bits.

I guess what I'd like to hear your take on is whether you think adding
run-length encoding to any of the byte streams would be beneficial before
applying Zstd.

Thanks!

rb

On Tue, Sep 3, 2019 at 12:30 PM Wes McKinney 
<[email protected]<mailto:[email protected]>> wrote:

> I'm interested in this. I have been busy the last couple of weeks so have
> not been able to take a closer look. I will try to give some feedback this
> week.
>
> Thanks
>
> On Tue, Sep 3, 2019, 2:17 PM Radev, Martin 
> <[email protected]<mailto:[email protected]>> wrote:
>
> > Hello all,
> >
> >
> > thank you Julien for the interest.
> >
> >
> > Could other people, part of Apache Parquet, share their opinions?
> >
> > Do you have your own data which you would like to use for testing the new
> > encoding?
> >
> >
> > Regards,
> >
> > Martin
> >
> > ________________________________
> > From: Julien Le Dem <[email protected]>
> > Sent: Friday, August 30, 2019 2:38:37 AM
> > To: [email protected]<mailto:[email protected]>
> > Cc: Raoofy, Amir; Karlstetter, Roman
> > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> >
> > I think this looks promising to me. At first glance it seems combining
> > simplicity and efficiency.
> > I'd like to hear more from other members of the PMC.
> >
> > On Tue, Aug 27, 2019 at 5:30 AM Radev, Martin 
> > <[email protected]<mailto:[email protected]>>
> wrote:
> >
> > > Dear all,
> > >
> > >
> > > there was some earlier discussion on adding a new encoding for better
> > > compression of FP32 and FP64 data.
> > >
> > >
> > > The pull request which extends the format is here:
> > > https://github.com/apache/parquet-format/pull/144
> > > The change has one approval from earlier from Zoltan.
> > >
> > >
> > > The results from an investigation on compression ratio and speed with
> the
> > > new encoding vs other encodings is available here:
> > > https://github.com/martinradev/arrow-fp-compression-bench
> > > It is visible that for many tests the new encoding performs better in
> > > compression ratio and in some cases in speed. The improvements in
> > > compression speed come from the fact that the new format can
> potentially
> > > lead to a faster parsing for some compressors like GZIP.
> > >
> > >
> > > An earlier report which examines other FP compressors (fpzip, spdp,
> fpc,
> > > zfp, sz) and new potential encodings is available here:
> > >
> >
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> > > The report also covers lossy compression but the BYTE_STREAM_SPLIT
> > > encoding only has the focus of lossless compression.
> > >
> > >
> > > Can we have a vote?
> > >
> > >
> > > Regards,
> > >
> > > Martin
> > >
> > >
> >
>

--
Ryan Blue
Software Engineer
Netflix

--
Ryan Blue
Software Engineer
Netflix

Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

Reply via email to