Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

Wes McKinney Tue, 05 Nov 2019 14:23:09 -0800

+1 from me on adding the FP encoding


On Sat, Nov 2, 2019 at 4:51 AM Radev, Martin <[email protected]> wrote:
>
> Hello all,
>
>
> thanks for the vote Ryan and to Wes for the feedback.
>
>
> The concern with regards to adding more complex features in the Parquet spec 
> is valid.
>
> However, the proposed encoding is very simple and I already have unpolished 
> patches for both parquet-mr and arrow.
>
> In its design I purposely opted for something simple to guarantee 1) good 
> compression speed and 2) ease of implementation.
>
>
> On the topic of testing, I added four more test cases which were taken from 
> here<https://sdrbench.github.io/>. I also added the size in MB of all test 
> case and entropy per element.
>
> Having the entropy reported helps show that the encoding performs better than 
> any other option for high-entropy data and not so well for low-entropy data.
>
>
> I would be happy to receive some more feedback and votes.
>
>
> Kind regards,
>
> Martin
>
> ________________________________
> From: Ryan Blue <[email protected]>
> Sent: Friday, November 1, 2019 6:28 PM
> To: Parquet Dev
> Cc: Raoofy, Amir; Karlstetter, Roman
> Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
>
> I'm +1 for adding the definition of BYTE_STREAM_SPLIT to the format. Looks
> like it is simple and performs well. We could use a good floating point
> encoding.
>
> I don't think I agree that differences in features between the Java and CPP
> implementations should hold back new work. It would be great to have more
> testing and validation, as well as more thorough implementations. But I
> don't think we shouldn't accept contributions like this because of those
> concerns.
>
> On Fri, Nov 1, 2019 at 9:27 AM Wes McKinney <[email protected]> wrote:
>
> > I have to say I'm struggling with piling more things into the Parquet
> > specification when we already have a significant implementation
> > shortfall in other areas. LZ4 is still not properly implemented for
> > example, and then there is the question of the V2 encodings and data
> > page formats.
> >
> > I'm generally in favor of adding more efficient storage of floating
> > point data, but will it actually be implemented broadly? Parquet as a
> > format already has become an "implementation swamp" where any two
> > implementations may not be compatible with each other, particularly in
> > consideration of more advanced features.
> >
> > For a single organization using a single implementation, having
> > advanced features may be useful, so I see the benefits to users that
> > tightly control what code and what settings to use.
> >
> > On Thu, Oct 31, 2019 at 3:51 AM Radev, Martin <[email protected]> wrote:
> > >
> > > Dear all,
> > >
> > >
> > > would there be any interest in reviewing the BYTE_STREAM_SPLIT encoding?
> > >
> > > Please feel free to contact me directly if you need help or would like
> > to provide more test data.
> > >
> > >
> > > Results for the encoding based on the implementation in Arrow are here:
> > https://github.com/martinradev/arrow-fp-compression-bench
> > > Patch to Arrow is here:
> > https://github.com/martinradev/arrow/commit/10de1e0f8a513b742edddeb6ba0d553617b1aa49
> > >
> > >
> > > The new encoding combined with a compressor performs better than any of
> > the other alternatives for data where there is little change in the
> > upper-most bytes of fp32 and fp64 values. My early experiments also show
> > that this encoding+zstd performs better on average than any of the
> > specialized floating-point lossless compressors like fpc, spdp, zfp.
> > >
> > >
> > > Regards,
> > >
> > > Martin
> > >
> > > ________________________________
> > > From: Radev, Martin <[email protected]>
> > > Sent: Thursday, October 10, 2019 2:34:15 PM
> > > To: Parquet Dev
> > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> > >
> > > Dear Ryan Blue and other Parquet developers,
> > >
> > > I tested Ryan's proposal for modifying the encoding.
> > >
> > > The short answer is that it doesn't perform well in my tests. The
> > encoding, code and results can be viewed below.
> > >
> > >
> > > The current implementation only handles 32-bit IEEE754 floats in the
> > following way:
> > >
> > >   1.  For each block of 128 values, the min and max is computed for the
> > exponent
> > >   2.  The number of bits for the exponent RLE is computed as
> > ceil(log2((max - min + 1))). The sign bit uses 1 bit.
> > >   3.  The sign, exponent and 23 remaining mantissa bits are extracted.
> > >   4.  One RLE encoder is used for the sign and one for the exponent.
> > > A new RLE encoder for the exponent is created if the block requires less
> > or more bits than the number of bits used for the current encoder.
> > > The 23 mantissa bits are divided into three streams. (Not sure whether
> > this is strictly a good idea).
> > >   5.  Also, for each 128 values we need to store 2 bytes: the min value
> > and the number of bits used by the RLE.
> > >
> > >
> > > I did not implement a decoder and have not added unit tests to guarantee
> > that the implementation is sound.
> > >
> > > Ryan, can you please review the implementation as to whether it
> > corresponds to what you had in mind?
> > > Check FPByteStreamSplitComponentRLEEncoder. Please do not focus on the
> > code quality - this is only a quick hack.
> > >
> > https://github.com/martinradev/arrow/commit/7c1cc8b2a5ff9e6288c615c2acc44b1dce1bd773#diff-47fe879cb9baad6c633c55f0a34a09c3R979
> > >
> > > In my benchmark tests I compared the following scenarios:
> > >
> > >   *   dictionary
> > >   *   plain encoding + ZSTD
> > >   *   the StreamSplit+RLE implementation above
> > >   *   the StreamSplit+RLE implementation above + ZSTD
> > >   *   the original BYTE_STREAM_SPLIT proposal + ZSTD
> > >
> > > Size of parquet files, measured in MB, for each combination:
> > >
> > >         Plain   Plain+ZSTD      Dict    Split+RLE       Split+RLE+ZSTD
> > StreamSplit+ZSTD
> > > msg_bt.sp
> > >         128     112     128     113     93      84
> > > msg_lu.sp
> > >         93      87      94      80      73      67
> > > msg_sp.sp
> > >         139     125     139     123     96      88
> > > msg_sweep3d.sp
> > >         60      19      60      47      40      13
> > > num_brain.sp
> > >         68      60      69      55      54      49
> > > num_comet.sp
> > >         52      45      50      46      42      38
> > > num_control.sp
> > >         77      71      77      70      69      63
> > > num_plasma.sp
> > >         17      2       8       16      8       1
> > > obs_error.sp
> > >         30      24      30      25      22      21
> > > obs_info.sp
> > >         10      8       10      8       7       7
> > > obs_spitzer.sp
> > >         95      83      95      99      82      72
> > > obs_temp.sp
> > >         20      18      20      18      18      16
> > >
> > >
> > > I have not tested the encoding on any of the other test data we have
> > since they contain also FP64 columns. I did not add support for FP64 in the
> > StreamSplit+RLE encoding to save on time and also because I do not expect
> > much improvement.
> > >
> > >
> > > From the results it can be seen that the proposed StreamSplit+RLE
> > encoding does not improve results.
> > >
> > > Using RLE for the sign bits and exponent obfuscates the input and the
> > compression step cannot fully take advantage of repetitions in the data
> > since they were removed from the RLE step. Repetitive patterns are replaced
> > by the RLE bits which likely do not compress very well.
> > >
> > > Also, GZIP/ZSTD handle repetitions in data better on average. For
> > example, GZIP's Deflate algorithm can encode a long run length with 3
> > bytes(not sure whether it can be less?) for the raw data + 3 bytes for the
> > reference ( 8 bits + 15 bits + 2 bits ).
> > >
> > > Now, the RLE step might produce better results for long runs of the same
> > value. However, compressors also handles more complex cases when we have a
> > pattern in the data which doesn't necessary have a long run length. Also,
> > compressors like GZIP/ZSTD often do entropy-coding-aware parsing (
> > http://cbloomrants.blogspot.com/2008/10/10-10-08-7_10.html ,
> cbloom rants: 10-10-08 : On LZ Optimal 
> Parsing<http://cbloomrants.blogspot.com/2008/10/10-10-08-7_10.html>
> cbloomrants.blogspot.com
> 10-10-08 On LZ Optimal Parsing So the LZ-Huffman I've done for RAD/Oodle does 
> some new stuff with optimal parsing. I want to write...
>
>
>
> > https://martinradev.github.io/jekyll/update/2019/05/29/writing-a-pe32-x86-exe-packer.html
> > ).
> > >
> > > The proposed RLE encoding+ZSTD is slower than BYTE_STREAM_SPLIT+ZSTD.
> > The reason is that BYTE_STREAM_SPLIT only does a scattered memcpy where as
> > the new RLE encoding has to extract the components and run through each
> > block of 128 values twice - once to compute min and max, and once to
> > encode. There's also the overhead of using the Arrow RLEEncoder.
> > >
> > > Ryan and other folks, can you provide feedback? Does the implementation
> > look reasonable to you?
> > >
> > > Can somebody please work with me on this new encoding? There has been
> > some interest and some discussions but it hasn't been pushed likely due to
> > work around the current release.
> > > For a bit more discussions and results, please refer to:
> > > Recent benchmark of Arrow implementation:
> > https://github.com/martinradev/arrow-fp-compression-bench
> > > Separate report:
> > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> [https://lh3.googleusercontent.com/0Nyayc6yUgH07IH12mpd3FJ8OZ7MX282uaxarQ0ffc5sJT_-hiMR5aw60Yg=w1200-h630-p]<https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing>
>
> report.pdf - Google 
> Drive<https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing>
> drive.google.com
>
>
>
> > >
> > > Regards,
> > > Martin
> > >
> > >
> > >
> > > ________________________________
> > > From: Ryan Blue <[email protected]>
> > > Sent: Thursday, September 19, 2019 7:54 PM
> > > To: Radev, Martin
> > > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> > >
> > > Sounds good, thanks for working on this!
> > >
> > > On Thu, Sep 19, 2019 at 6:10 AM Radev, Martin <[email protected]
> > <mailto:[email protected]>> wrote:
> > >
> > > Hello Ryan,
> > >
> > >
> > > we decided that it would be beneficial to try out your proposal.
> > >
> > >
> > > I will look into it and provide measurements on the compression ratio
> > and speed.
> > >
> > >
> > > Regards,
> > >
> > > Martin
> > >
> > > ________________________________
> > > From: Ryan Blue <[email protected]<mailto:[email protected]>>
> > > Sent: Saturday, September 14, 2019 2:23:20 AM
> > > To: Radev, Martin
> > > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> > >
> > > > Using RLE for the sign, exponents and the top-most mantissa bytes can
> > help when data is repetitive and make it worse for other.
> > >
> > > I agree. But we use RLE in similar cases because we do tend to have runs
> > of values, and values that fit in a fixed number of bits. Exponents and
> > sign bits would probably fit this model extremely well most of the time if
> > you have similar floating point values or sorted values. It would be really
> > interesting to see how well this performs in comparison to the compression
> > tests you've already done. For mantissa bits, I agree it wouldn't be worth
> > encoding first.
> > >
> > > On Thu, Sep 12, 2019 at 2:56 AM Radev, Martin <[email protected]
> > <mailto:[email protected]>> wrote:
> > >
> > > Hello Ryan, Wes and other parquet devs,
> > >
> > >
> > > thanks for the response. I was away on vacation and that's why I am
> > answering just now.
> > >
> > >
> > > > whether you think adding run-length encoding to any of the byte
> > streams would be beneficial before applying Zstd.
> > > The short answer is "only for some cases but it will make it worse in
> > both compression ratio and speed for other".
> > >
> > > Our initial investigation also separated the sign, exponent and mantissa
> > into separate streams.
> > >
> > > The encoding was the following assuming 32-bit IEEE754:
> > >
> > > - stream of sign bits
> > >
> > > - stream of exponents bits. Conveniently the exponent for a 32-bit
> > IEEE754 number is 8 bits.
> > >
> > > - separate the remaining 23 bits into four streams of 8, 8, 7 bits. An
> > extra zero bit is added to the block which has only seven bits. This was
> > done since zstd, zlib, etc work at a byte granularity and we would want
> > repetitions to happen at such.
> > >
> > > For 64-bit IEEE754 even more padding has to be added since the exponent
> > is 11 bits and the mantissa is 52 bits. Thus, we have to add 5 more
> > exponent bits and 4 more mantissa bits to keep repetitions at a byte
> > granularity. My original report shows results for when the floating-point
> > values are split at a component granularity. Report is here:
> > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
> > > Results are just slightly better in terms of compression ratio for some
> > tests but compression and decompression speed is expectedly worse. The
> > reason is that splitting a value is somewhat more complex. We need to keep
> > a stream of bits for the signs, keep track of when a byte in the stream is
> > exhausted, do bit manipulation to extract components, etc. This is also the
> > reason why I preferred to go with the byte-wise decomposition of the
> > values. It's faster and the compression ratio is just slightly worse for
> > some of the tests.
> > >
> > >
> > > Using RLE for the sign, exponents and the top-most mantissa bytes can
> > help when data is repetitive and make it worse for other. I suppose using
> > one of the compressors yields a better compression ratio on average. Also,
> > this can again make encoding and decoding slower.
> > >
> > >
> > > The design of the BYTE_STREAM_SPLIT encoding had in mind two things:
> > >
> > > - It would only make data more compressible and leave compression to the
> > codec in use.
> > >   This leaves the complexity to the codec and choice of
> > speed/compression ratio to the user.
> > >
> > > - It should be fast.
> > >   There's an extra compression step so preferably there's very little
> > latency before it.
> > >
> > > @Wes, can you have a look?
> > >
> > > More opinions are welcome.
> > >
> > > If you have floating point data available, I would be very happy to
> > examine whether this approach offers benefit for you.
> > >
> > >
> > > Regards,
> > >
> > > Martin
> > >
> > > ________________________________
> > > From: Ryan Blue <[email protected]>
> > > Sent: Tuesday, September 3, 2019 11:51:46 PM
> > > To: Parquet Dev
> > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> > >
> > > Hi Martin,
> > >
> > > Thanks for taking a look at this! I agree that the approach here looks
> > > promising. We've had occasional requests for lossy floating point
> > > compression in the past, so it would be good to add this.
> > >
> > > I did some work in this area a few years ago that is similar and I'd like
> > > to hear what you think about that approach compared to this one. That
> > work
> > > was based on the same observation, that the main problem is the mantissa,
> > > while exponents tend to compress well. What I did was take the exponent
> > and
> > > mantissa and encode each separately, like the component encoding in your
> > > test. But to encode each stream, I used Parquet's RLE encoder instead of
> > > just applying compression. This seemed to work well for exponents and
> > sign
> > > bits, but probably isn't worth the cost for mantissa bits. It could also
> > be
> > > interesting to test a separate stream for sign bits.
> > >
> > > I guess what I'd like to hear your take on is whether you think adding
> > > run-length encoding to any of the byte streams would be beneficial before
> > > applying Zstd.
> > >
> > > Thanks!
> > >
> > > rb
> > >
> > > On Tue, Sep 3, 2019 at 12:30 PM Wes McKinney <[email protected]
> > <mailto:[email protected]>> wrote:
> > >
> > > > I'm interested in this. I have been busy the last couple of weeks so
> > have
> > > > not been able to take a closer look. I will try to give some feedback
> > this
> > > > week.
> > > >
> > > > Thanks
> > > >
> > > > On Tue, Sep 3, 2019, 2:17 PM Radev, Martin <[email protected]
> > <mailto:[email protected]>> wrote:
> > > >
> > > > > Hello all,
> > > > >
> > > > >
> > > > > thank you Julien for the interest.
> > > > >
> > > > >
> > > > > Could other people, part of Apache Parquet, share their opinions?
> > > > >
> > > > > Do you have your own data which you would like to use for testing
> > the new
> > > > > encoding?
> > > > >
> > > > >
> > > > > Regards,
> > > > >
> > > > > Martin
> > > > >
> > > > > ________________________________
> > > > > From: Julien Le Dem <[email protected]>
> > > > > Sent: Friday, August 30, 2019 2:38:37 AM
> > > > > To: [email protected]<mailto:[email protected]>
> > > > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> > > > >
> > > > > I think this looks promising to me. At first glance it seems
> > combining
> > > > > simplicity and efficiency.
> > > > > I'd like to hear more from other members of the PMC.
> > > > >
> > > > > On Tue, Aug 27, 2019 at 5:30 AM Radev, Martin <[email protected]
> > <mailto:[email protected]>>
> > > > wrote:
> > > > >
> > > > > > Dear all,
> > > > > >
> > > > > >
> > > > > > there was some earlier discussion on adding a new encoding for
> > better
> > > > > > compression of FP32 and FP64 data.
> > > > > >
> > > > > >
> > > > > > The pull request which extends the format is here:
> > > > > > https://github.com/apache/parquet-format/pull/144
> > > > > > The change has one approval from earlier from Zoltan.
> > > > > >
> > > > > >
> > > > > > The results from an investigation on compression ratio and speed
> > with
> > > > the
> > > > > > new encoding vs other encodings is available here:
> > > > > > https://github.com/martinradev/arrow-fp-compression-bench
> > > > > > It is visible that for many tests the new encoding performs better
> > in
> > > > > > compression ratio and in some cases in speed. The improvements in
> > > > > > compression speed come from the fact that the new format can
> > > > potentially
> > > > > > lead to a faster parsing for some compressors like GZIP.
> > > > > >
> > > > > >
> > > > > > An earlier report which examines other FP compressors (fpzip, spdp,
> > > > fpc,
> > > > > > zfp, sz) and new potential encodings is available here:
> > > > > >
> > > > >
> > > >
> > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> > > > > > The report also covers lossy compression but the BYTE_STREAM_SPLIT
> > > > > > encoding only has the focus of lossless compression.
> > > > > >
> > > > > >
> > > > > > Can we have a vote?
> > > > > >
> > > > > >
> > > > > > Regards,
> > > > > >
> > > > > > Martin
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix

Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

Reply via email to