Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

Radev, Martin Thu, 10 Oct 2019 05:34:41 -0700

Dear Ryan Blue and other Parquet developers,

I tested Ryan's proposal for modifying the encoding.

The short answer is that it doesn't perform well in my tests. The encoding, 
code and results can be viewed below.

The current implementation only handles 32-bit IEEE754 floats in the following 
way:

  1.  For each block of 128 values, the min and max is computed for the exponent
  2.  The number of bits for the exponent RLE is computed as ceil(log2((max - 
min + 1))). The sign bit uses 1 bit.
  3.  The sign, exponent and 23 remaining mantissa bits are extracted.
  4.  One RLE encoder is used for the sign and one for the exponent.
A new RLE encoder for the exponent is created if the block requires less or 
more bits than the number of bits used for the current encoder.
The 23 mantissa bits are divided into three streams. (Not sure whether this is 
strictly a good idea).
  5.  Also, for each 128 values we need to store 2 bytes: the min value and the 
number of bits used by the RLE.

I did not implement a decoder and have not added unit tests to guarantee that 
the implementation is sound.

Ryan, can you please review the implementation as to whether it corresponds to 
what you had in mind?
Check FPByteStreamSplitComponentRLEEncoder. Please do not focus on the code 
quality - this is only a quick hack.
https://github.com/martinradev/arrow/commit/7c1cc8b2a5ff9e6288c615c2acc44b1dce1bd773#diff-47fe879cb9baad6c633c55f0a34a09c3R979

In my benchmark tests I compared the following scenarios:

  *   dictionary
  *   plain encoding + ZSTD
  *   the StreamSplit+RLE implementation above
  *   the StreamSplit+RLE implementation above + ZSTD
  *   the original BYTE_STREAM_SPLIT proposal + ZSTD

Size of parquet files, measured in MB, for each combination:

        Plain   Plain+ZSTD      Dict    Split+RLE       Split+RLE+ZSTD  
StreamSplit+ZSTD
msg_bt.sp
        128     112     128     113     93      84
msg_lu.sp
        93      87      94      80      73      67
msg_sp.sp
        139     125     139     123     96      88
msg_sweep3d.sp
        60      19      60      47      40      13
num_brain.sp
        68      60      69      55      54      49
num_comet.sp
        52      45      50      46      42      38
num_control.sp
        77      71      77      70      69      63
num_plasma.sp
        17      2       8       16      8       1
obs_error.sp
        30      24      30      25      22      21
obs_info.sp
        10      8       10      8       7       7
obs_spitzer.sp
        95      83      95      99      82      72
obs_temp.sp
        20      18      20      18      18      16

I have not tested the encoding on any of the other test data we have since they 
contain also FP64 columns. I did not add support for FP64 in the 
StreamSplit+RLE encoding to save on time and also because I do not expect much 
improvement.

>From the results it can be seen that the proposed StreamSplit+RLE encoding 
>does not improve results.

Using RLE for the sign bits and exponent obfuscates the input and the 
compression step cannot fully take advantage of repetitions in the data since 
they were removed from the RLE step. Repetitive patterns are replaced by the 
RLE bits which likely do not compress very well.

Also, GZIP/ZSTD handle repetitions in data better on average. For example, 
GZIP's Deflate algorithm can encode a long run length with 3 bytes(not sure 
whether it can be less?) for the raw data + 3 bytes for the reference ( 8 bits 
+ 15 bits + 2 bits ).

Now, the RLE step might produce better results for long runs of the same value. 
However, compressors also handles more complex cases when we have a pattern in 
the data which doesn't necessary have a long run length. Also, compressors like 
GZIP/ZSTD often do entropy-coding-aware parsing ( 
http://cbloomrants.blogspot.com/2008/10/10-10-08-7_10.html , 
https://martinradev.github.io/jekyll/update/2019/05/29/writing-a-pe32-x86-exe-packer.html).

The proposed RLE encoding+ZSTD is slower than BYTE_STREAM_SPLIT+ZSTD. The 
reason is that BYTE_STREAM_SPLIT only does a scattered memcpy where as the new 
RLE encoding has to extract the components and run through each block of 128 
values twice - once to compute min and max, and once to encode. There's also 
the overhead of using the Arrow RLEEncoder.

Ryan and other folks, can you provide feedback? Does the implementation look 
reasonable to you?

Can somebody please work with me on this new encoding? There has been some 
interest and some discussions but it hasn't been pushed likely due to work 
around the current release.
For a bit more discussions and results, please refer to:
Recent benchmark of Arrow implementation: 
https://github.com/martinradev/arrow-fp-compression-bench
Separate report: 
https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing

Regards,
Martin

________________________________
From: Ryan Blue <rb...@netflix.com>
Sent: Thursday, September 19, 2019 7:54 PM
To: Radev, Martin
Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

Sounds good, thanks for working on this!

On Thu, Sep 19, 2019 at 6:10 AM Radev, Martin 
<martin.ra...@tum.de<mailto:martin.ra...@tum.de>> wrote:

Hello Ryan,

we decided that it would be beneficial to try out your proposal.

I will look into it and provide measurements on the compression ratio and speed.

Regards,

Martin

________________________________
From: Ryan Blue <rb...@netflix.com<mailto:rb...@netflix.com>>
Sent: Saturday, September 14, 2019 2:23:20 AM
To: Radev, Martin
Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

> Using RLE for the sign, exponents and the top-most mantissa bytes can help 
> when data is repetitive and make it worse for other.

I agree. But we use RLE in similar cases because we do tend to have runs of 
values, and values that fit in a fixed number of bits. Exponents and sign bits 
would probably fit this model extremely well most of the time if you have 
similar floating point values or sorted values. It would be really interesting 
to see how well this performs in comparison to the compression tests you've 
already done. For mantissa bits, I agree it wouldn't be worth encoding first.

On Thu, Sep 12, 2019 at 2:56 AM Radev, Martin 
<martin.ra...@tum.de<mailto:martin.ra...@tum.de>> wrote:

Hello Ryan, Wes and other parquet devs,

thanks for the response. I was away on vacation and that's why I am answering 
just now.

> whether you think adding run-length encoding to any of the byte streams would 
> be beneficial before applying Zstd.
The short answer is "only for some cases but it will make it worse in both 
compression ratio and speed for other".

Our initial investigation also separated the sign, exponent and mantissa into 
separate streams.

The encoding was the following assuming 32-bit IEEE754:

- stream of sign bits

- stream of exponents bits. Conveniently the exponent for a 32-bit IEEE754 
number is 8 bits.

- separate the remaining 23 bits into four streams of 8, 8, 7 bits. An extra 
zero bit is added to the block which has only seven bits. This was done since 
zstd, zlib, etc work at a byte granularity and we would want repetitions to 
happen at such.

For 64-bit IEEE754 even more padding has to be added since the exponent is 11 
bits and the mantissa is 52 bits. Thus, we have to add 5 more exponent bits and 
4 more mantissa bits to keep repetitions at a byte granularity. My original 
report shows results for when the floating-point values are split at a 
component granularity. Report is here: 
https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
Results are just slightly better in terms of compression ratio for some tests 
but compression and decompression speed is expectedly worse. The reason is that 
splitting a value is somewhat more complex. We need to keep a stream of bits 
for the signs, keep track of when a byte in the stream is exhausted, do bit 
manipulation to extract components, etc. This is also the reason why I 
preferred to go with the byte-wise decomposition of the values. It's faster and 
the compression ratio is just slightly worse for some of the tests.

Using RLE for the sign, exponents and the top-most mantissa bytes can help when 
data is repetitive and make it worse for other. I suppose using one of the 
compressors yields a better compression ratio on average. Also, this can again 
make encoding and decoding slower.

The design of the BYTE_STREAM_SPLIT encoding had in mind two things:

- It would only make data more compressible and leave compression to the codec 
in use.
  This leaves the complexity to the codec and choice of speed/compression ratio 
to the user.

- It should be fast.
  There's an extra compression step so preferably there's very little latency 
before it.

@Wes, can you have a look?

More opinions are welcome.

If you have floating point data available, I would be very happy to examine 
whether this approach offers benefit for you.

Regards,

Martin

________________________________
From: Ryan Blue <rb...@netflix.com.INVALID>
Sent: Tuesday, September 3, 2019 11:51:46 PM
To: Parquet Dev
Cc: Raoofy, Amir; Karlstetter, Roman
Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

Hi Martin,

Thanks for taking a look at this! I agree that the approach here looks
promising. We've had occasional requests for lossy floating point
compression in the past, so it would be good to add this.

I did some work in this area a few years ago that is similar and I'd like
to hear what you think about that approach compared to this one. That work
was based on the same observation, that the main problem is the mantissa,
while exponents tend to compress well. What I did was take the exponent and
mantissa and encode each separately, like the component encoding in your
test. But to encode each stream, I used Parquet's RLE encoder instead of
just applying compression. This seemed to work well for exponents and sign
bits, but probably isn't worth the cost for mantissa bits. It could also be
interesting to test a separate stream for sign bits.

I guess what I'd like to hear your take on is whether you think adding
run-length encoding to any of the byte streams would be beneficial before
applying Zstd.

Thanks!

rb

On Tue, Sep 3, 2019 at 12:30 PM Wes McKinney 
<wesmck...@gmail.com<mailto:wesmck...@gmail.com>> wrote:

> I'm interested in this. I have been busy the last couple of weeks so have
> not been able to take a closer look. I will try to give some feedback this
> week.
>
> Thanks
>
> On Tue, Sep 3, 2019, 2:17 PM Radev, Martin 
> <martin.ra...@tum.de<mailto:martin.ra...@tum.de>> wrote:
>
> > Hello all,
> >
> >
> > thank you Julien for the interest.
> >
> >
> > Could other people, part of Apache Parquet, share their opinions?
> >
> > Do you have your own data which you would like to use for testing the new
> > encoding?
> >
> >
> > Regards,
> >
> > Martin
> >
> > ________________________________
> > From: Julien Le Dem <julien.le...@wework.com.INVALID>
> > Sent: Friday, August 30, 2019 2:38:37 AM
> > To: dev@parquet.apache.org<mailto:dev@parquet.apache.org>
> > Cc: Raoofy, Amir; Karlstetter, Roman
> > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> >
> > I think this looks promising to me. At first glance it seems combining
> > simplicity and efficiency.
> > I'd like to hear more from other members of the PMC.
> >
> > On Tue, Aug 27, 2019 at 5:30 AM Radev, Martin 
> > <martin.ra...@tum.de<mailto:martin.ra...@tum.de>>
> wrote:
> >
> > > Dear all,
> > >
> > >
> > > there was some earlier discussion on adding a new encoding for better
> > > compression of FP32 and FP64 data.
> > >
> > >
> > > The pull request which extends the format is here:
> > > https://github.com/apache/parquet-format/pull/144
> > > The change has one approval from earlier from Zoltan.
> > >
> > >
> > > The results from an investigation on compression ratio and speed with
> the
> > > new encoding vs other encodings is available here:
> > > https://github.com/martinradev/arrow-fp-compression-bench
> > > It is visible that for many tests the new encoding performs better in
> > > compression ratio and in some cases in speed. The improvements in
> > > compression speed come from the fact that the new format can
> potentially
> > > lead to a faster parsing for some compressors like GZIP.
> > >
> > >
> > > An earlier report which examines other FP compressors (fpzip, spdp,
> fpc,
> > > zfp, sz) and new potential encodings is available here:
> > >
> >
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> > > The report also covers lossy compression but the BYTE_STREAM_SPLIT
> > > encoding only has the focus of lossless compression.
> > >
> > >
> > > Can we have a vote?
> > >
> > >
> > > Regards,
> > >
> > > Martin
> > >
> > >
> >
>

--
Ryan Blue
Software Engineer
Netflix

--
Ryan Blue
Software Engineer
Netflix

--
Ryan Blue
Software Engineer
Netflix

Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

Reply via email to