Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

Radev, Martin Sat, 02 Nov 2019 02:51:53 -0700

Hello all,


thanks for the vote Ryan and to Wes for the feedback.


The concern with regards to adding more complex features in the Parquet spec is 
valid.

However, the proposed encoding is very simple and I already have unpolished 
patches for both parquet-mr and arrow.

In its design I purposely opted for something simple to guarantee 1) good 
compression speed and 2) ease of implementation.


On the topic of testing, I added four more test cases which were taken from 
here<https://sdrbench.github.io/>. I also added the size in MB of all test case 
and entropy per element.

Having the entropy reported helps show that the encoding performs better than 
any other option for high-entropy data and not so well for low-entropy data.


I would be happy to receive some more feedback and votes.


Kind regards,

Martin

________________________________
From: Ryan Blue <rb...@netflix.com.INVALID>
Sent: Friday, November 1, 2019 6:28 PM
To: Parquet Dev
Cc: Raoofy, Amir; Karlstetter, Roman
Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

I'm +1 for adding the definition of BYTE_STREAM_SPLIT to the format. Looks
like it is simple and performs well. We could use a good floating point
encoding.

I don't think I agree that differences in features between the Java and CPP
implementations should hold back new work. It would be great to have more
testing and validation, as well as more thorough implementations. But I
don't think we shouldn't accept contributions like this because of those
concerns.

On Fri, Nov 1, 2019 at 9:27 AM Wes McKinney <wesmck...@gmail.com> wrote:

> I have to say I'm struggling with piling more things into the Parquet
> specification when we already have a significant implementation
> shortfall in other areas. LZ4 is still not properly implemented for
> example, and then there is the question of the V2 encodings and data
> page formats.
>
> I'm generally in favor of adding more efficient storage of floating
> point data, but will it actually be implemented broadly? Parquet as a
> format already has become an "implementation swamp" where any two
> implementations may not be compatible with each other, particularly in
> consideration of more advanced features.
>
> For a single organization using a single implementation, having
> advanced features may be useful, so I see the benefits to users that
> tightly control what code and what settings to use.
>
> On Thu, Oct 31, 2019 at 3:51 AM Radev, Martin <martin.ra...@tum.de> wrote:
> >
> > Dear all,
> >
> >
> > would there be any interest in reviewing the BYTE_STREAM_SPLIT encoding?
> >
> > Please feel free to contact me directly if you need help or would like
> to provide more test data.
> >
> >
> > Results for the encoding based on the implementation in Arrow are here:
> https://github.com/martinradev/arrow-fp-compression-bench
> > Patch to Arrow is here:
> https://github.com/martinradev/arrow/commit/10de1e0f8a513b742edddeb6ba0d553617b1aa49
> >
> >
> > The new encoding combined with a compressor performs better than any of
> the other alternatives for data where there is little change in the
> upper-most bytes of fp32 and fp64 values. My early experiments also show
> that this encoding+zstd performs better on average than any of the
> specialized floating-point lossless compressors like fpc, spdp, zfp.
> >
> >
> > Regards,
> >
> > Martin
> >
> > ________________________________
> > From: Radev, Martin <martin.ra...@tum.de>
> > Sent: Thursday, October 10, 2019 2:34:15 PM
> > To: Parquet Dev
> > Cc: Raoofy, Amir; Karlstetter, Roman
> > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> >
> > Dear Ryan Blue and other Parquet developers,
> >
> > I tested Ryan's proposal for modifying the encoding.
> >
> > The short answer is that it doesn't perform well in my tests. The
> encoding, code and results can be viewed below.
> >
> >
> > The current implementation only handles 32-bit IEEE754 floats in the
> following way:
> >
> >   1.  For each block of 128 values, the min and max is computed for the
> exponent
> >   2.  The number of bits for the exponent RLE is computed as
> ceil(log2((max - min + 1))). The sign bit uses 1 bit.
> >   3.  The sign, exponent and 23 remaining mantissa bits are extracted.
> >   4.  One RLE encoder is used for the sign and one for the exponent.
> > A new RLE encoder for the exponent is created if the block requires less
> or more bits than the number of bits used for the current encoder.
> > The 23 mantissa bits are divided into three streams. (Not sure whether
> this is strictly a good idea).
> >   5.  Also, for each 128 values we need to store 2 bytes: the min value
> and the number of bits used by the RLE.
> >
> >
> > I did not implement a decoder and have not added unit tests to guarantee
> that the implementation is sound.
> >
> > Ryan, can you please review the implementation as to whether it
> corresponds to what you had in mind?
> > Check FPByteStreamSplitComponentRLEEncoder. Please do not focus on the
> code quality - this is only a quick hack.
> >
> https://github.com/martinradev/arrow/commit/7c1cc8b2a5ff9e6288c615c2acc44b1dce1bd773#diff-47fe879cb9baad6c633c55f0a34a09c3R979
> >
> > In my benchmark tests I compared the following scenarios:
> >
> >   *   dictionary
> >   *   plain encoding + ZSTD
> >   *   the StreamSplit+RLE implementation above
> >   *   the StreamSplit+RLE implementation above + ZSTD
> >   *   the original BYTE_STREAM_SPLIT proposal + ZSTD
> >
> > Size of parquet files, measured in MB, for each combination:
> >
> >         Plain   Plain+ZSTD      Dict    Split+RLE       Split+RLE+ZSTD
> StreamSplit+ZSTD
> > msg_bt.sp
> >         128     112     128     113     93      84
> > msg_lu.sp
> >         93      87      94      80      73      67
> > msg_sp.sp
> >         139     125     139     123     96      88
> > msg_sweep3d.sp
> >         60      19      60      47      40      13
> > num_brain.sp
> >         68      60      69      55      54      49
> > num_comet.sp
> >         52      45      50      46      42      38
> > num_control.sp
> >         77      71      77      70      69      63
> > num_plasma.sp
> >         17      2       8       16      8       1
> > obs_error.sp
> >         30      24      30      25      22      21
> > obs_info.sp
> >         10      8       10      8       7       7
> > obs_spitzer.sp
> >         95      83      95      99      82      72
> > obs_temp.sp
> >         20      18      20      18      18      16
> >
> >
> > I have not tested the encoding on any of the other test data we have
> since they contain also FP64 columns. I did not add support for FP64 in the
> StreamSplit+RLE encoding to save on time and also because I do not expect
> much improvement.
> >
> >
> > From the results it can be seen that the proposed StreamSplit+RLE
> encoding does not improve results.
> >
> > Using RLE for the sign bits and exponent obfuscates the input and the
> compression step cannot fully take advantage of repetitions in the data
> since they were removed from the RLE step. Repetitive patterns are replaced
> by the RLE bits which likely do not compress very well.
> >
> > Also, GZIP/ZSTD handle repetitions in data better on average. For
> example, GZIP's Deflate algorithm can encode a long run length with 3
> bytes(not sure whether it can be less?) for the raw data + 3 bytes for the
> reference ( 8 bits + 15 bits + 2 bits ).
> >
> > Now, the RLE step might produce better results for long runs of the same
> value. However, compressors also handles more complex cases when we have a
> pattern in the data which doesn't necessary have a long run length. Also,
> compressors like GZIP/ZSTD often do entropy-coding-aware parsing (
> http://cbloomrants.blogspot.com/2008/10/10-10-08-7_10.html ,
cbloom rants: 10-10-08 : On LZ Optimal 
Parsing<http://cbloomrants.blogspot.com/2008/10/10-10-08-7_10.html>
cbloomrants.blogspot.com
10-10-08 On LZ Optimal Parsing So the LZ-Huffman I've done for RAD/Oodle does 
some new stuff with optimal parsing. I want to write...



> https://martinradev.github.io/jekyll/update/2019/05/29/writing-a-pe32-x86-exe-packer.html
> ).
> >
> > The proposed RLE encoding+ZSTD is slower than BYTE_STREAM_SPLIT+ZSTD.
> The reason is that BYTE_STREAM_SPLIT only does a scattered memcpy where as
> the new RLE encoding has to extract the components and run through each
> block of 128 values twice - once to compute min and max, and once to
> encode. There's also the overhead of using the Arrow RLEEncoder.
> >
> > Ryan and other folks, can you provide feedback? Does the implementation
> look reasonable to you?
> >
> > Can somebody please work with me on this new encoding? There has been
> some interest and some discussions but it hasn't been pushed likely due to
> work around the current release.
> > For a bit more discussions and results, please refer to:
> > Recent benchmark of Arrow implementation:
> https://github.com/martinradev/arrow-fp-compression-bench
> > Separate report:
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
[https://lh3.googleusercontent.com/0Nyayc6yUgH07IH12mpd3FJ8OZ7MX282uaxarQ0ffc5sJT_-hiMR5aw60Yg=w1200-h630-p]<https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing>

report.pdf - Google 
Drive<https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing>
drive.google.com



> >
> > Regards,
> > Martin
> >
> >
> >
> > ________________________________
> > From: Ryan Blue <rb...@netflix.com>
> > Sent: Thursday, September 19, 2019 7:54 PM
> > To: Radev, Martin
> > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> >
> > Sounds good, thanks for working on this!
> >
> > On Thu, Sep 19, 2019 at 6:10 AM Radev, Martin <martin.ra...@tum.de
> <mailto:martin.ra...@tum.de>> wrote:
> >
> > Hello Ryan,
> >
> >
> > we decided that it would be beneficial to try out your proposal.
> >
> >
> > I will look into it and provide measurements on the compression ratio
> and speed.
> >
> >
> > Regards,
> >
> > Martin
> >
> > ________________________________
> > From: Ryan Blue <rb...@netflix.com<mailto:rb...@netflix.com>>
> > Sent: Saturday, September 14, 2019 2:23:20 AM
> > To: Radev, Martin
> > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> >
> > > Using RLE for the sign, exponents and the top-most mantissa bytes can
> help when data is repetitive and make it worse for other.
> >
> > I agree. But we use RLE in similar cases because we do tend to have runs
> of values, and values that fit in a fixed number of bits. Exponents and
> sign bits would probably fit this model extremely well most of the time if
> you have similar floating point values or sorted values. It would be really
> interesting to see how well this performs in comparison to the compression
> tests you've already done. For mantissa bits, I agree it wouldn't be worth
> encoding first.
> >
> > On Thu, Sep 12, 2019 at 2:56 AM Radev, Martin <martin.ra...@tum.de
> <mailto:martin.ra...@tum.de>> wrote:
> >
> > Hello Ryan, Wes and other parquet devs,
> >
> >
> > thanks for the response. I was away on vacation and that's why I am
> answering just now.
> >
> >
> > > whether you think adding run-length encoding to any of the byte
> streams would be beneficial before applying Zstd.
> > The short answer is "only for some cases but it will make it worse in
> both compression ratio and speed for other".
> >
> > Our initial investigation also separated the sign, exponent and mantissa
> into separate streams.
> >
> > The encoding was the following assuming 32-bit IEEE754:
> >
> > - stream of sign bits
> >
> > - stream of exponents bits. Conveniently the exponent for a 32-bit
> IEEE754 number is 8 bits.
> >
> > - separate the remaining 23 bits into four streams of 8, 8, 7 bits. An
> extra zero bit is added to the block which has only seven bits. This was
> done since zstd, zlib, etc work at a byte granularity and we would want
> repetitions to happen at such.
> >
> > For 64-bit IEEE754 even more padding has to be added since the exponent
> is 11 bits and the mantissa is 52 bits. Thus, we have to add 5 more
> exponent bits and 4 more mantissa bits to keep repetitions at a byte
> granularity. My original report shows results for when the floating-point
> values are split at a component granularity. Report is here:
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
> > Results are just slightly better in terms of compression ratio for some
> tests but compression and decompression speed is expectedly worse. The
> reason is that splitting a value is somewhat more complex. We need to keep
> a stream of bits for the signs, keep track of when a byte in the stream is
> exhausted, do bit manipulation to extract components, etc. This is also the
> reason why I preferred to go with the byte-wise decomposition of the
> values. It's faster and the compression ratio is just slightly worse for
> some of the tests.
> >
> >
> > Using RLE for the sign, exponents and the top-most mantissa bytes can
> help when data is repetitive and make it worse for other. I suppose using
> one of the compressors yields a better compression ratio on average. Also,
> this can again make encoding and decoding slower.
> >
> >
> > The design of the BYTE_STREAM_SPLIT encoding had in mind two things:
> >
> > - It would only make data more compressible and leave compression to the
> codec in use.
> >   This leaves the complexity to the codec and choice of
> speed/compression ratio to the user.
> >
> > - It should be fast.
> >   There's an extra compression step so preferably there's very little
> latency before it.
> >
> > @Wes, can you have a look?
> >
> > More opinions are welcome.
> >
> > If you have floating point data available, I would be very happy to
> examine whether this approach offers benefit for you.
> >
> >
> > Regards,
> >
> > Martin
> >
> > ________________________________
> > From: Ryan Blue <rb...@netflix.com.INVALID>
> > Sent: Tuesday, September 3, 2019 11:51:46 PM
> > To: Parquet Dev
> > Cc: Raoofy, Amir; Karlstetter, Roman
> > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> >
> > Hi Martin,
> >
> > Thanks for taking a look at this! I agree that the approach here looks
> > promising. We've had occasional requests for lossy floating point
> > compression in the past, so it would be good to add this.
> >
> > I did some work in this area a few years ago that is similar and I'd like
> > to hear what you think about that approach compared to this one. That
> work
> > was based on the same observation, that the main problem is the mantissa,
> > while exponents tend to compress well. What I did was take the exponent
> and
> > mantissa and encode each separately, like the component encoding in your
> > test. But to encode each stream, I used Parquet's RLE encoder instead of
> > just applying compression. This seemed to work well for exponents and
> sign
> > bits, but probably isn't worth the cost for mantissa bits. It could also
> be
> > interesting to test a separate stream for sign bits.
> >
> > I guess what I'd like to hear your take on is whether you think adding
> > run-length encoding to any of the byte streams would be beneficial before
> > applying Zstd.
> >
> > Thanks!
> >
> > rb
> >
> > On Tue, Sep 3, 2019 at 12:30 PM Wes McKinney <wesmck...@gmail.com
> <mailto:wesmck...@gmail.com>> wrote:
> >
> > > I'm interested in this. I have been busy the last couple of weeks so
> have
> > > not been able to take a closer look. I will try to give some feedback
> this
> > > week.
> > >
> > > Thanks
> > >
> > > On Tue, Sep 3, 2019, 2:17 PM Radev, Martin <martin.ra...@tum.de
> <mailto:martin.ra...@tum.de>> wrote:
> > >
> > > > Hello all,
> > > >
> > > >
> > > > thank you Julien for the interest.
> > > >
> > > >
> > > > Could other people, part of Apache Parquet, share their opinions?
> > > >
> > > > Do you have your own data which you would like to use for testing
> the new
> > > > encoding?
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Martin
> > > >
> > > > ________________________________
> > > > From: Julien Le Dem <julien.le...@wework.com.INVALID>
> > > > Sent: Friday, August 30, 2019 2:38:37 AM
> > > > To: dev@parquet.apache.org<mailto:dev@parquet.apache.org>
> > > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> > > >
> > > > I think this looks promising to me. At first glance it seems
> combining
> > > > simplicity and efficiency.
> > > > I'd like to hear more from other members of the PMC.
> > > >
> > > > On Tue, Aug 27, 2019 at 5:30 AM Radev, Martin <martin.ra...@tum.de
> <mailto:martin.ra...@tum.de>>
> > > wrote:
> > > >
> > > > > Dear all,
> > > > >
> > > > >
> > > > > there was some earlier discussion on adding a new encoding for
> better
> > > > > compression of FP32 and FP64 data.
> > > > >
> > > > >
> > > > > The pull request which extends the format is here:
> > > > > https://github.com/apache/parquet-format/pull/144
> > > > > The change has one approval from earlier from Zoltan.
> > > > >
> > > > >
> > > > > The results from an investigation on compression ratio and speed
> with
> > > the
> > > > > new encoding vs other encodings is available here:
> > > > > https://github.com/martinradev/arrow-fp-compression-bench
> > > > > It is visible that for many tests the new encoding performs better
> in
> > > > > compression ratio and in some cases in speed. The improvements in
> > > > > compression speed come from the fact that the new format can
> > > potentially
> > > > > lead to a faster parsing for some compressors like GZIP.
> > > > >
> > > > >
> > > > > An earlier report which examines other FP compressors (fpzip, spdp,
> > > fpc,
> > > > > zfp, sz) and new potential encodings is available here:
> > > > >
> > > >
> > >
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> > > > > The report also covers lossy compression but the BYTE_STREAM_SPLIT
> > > > > encoding only has the focus of lossless compression.
> > > > >
> > > > >
> > > > > Can we have a vote?
> > > > >
> > > > >
> > > > > Regards,
> > > > >
> > > > > Martin
> > > > >
> > > > >
> > > >
> > >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>


--
Ryan Blue
Software Engineer
Netflix

Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

Reply via email to