+1 for adding BYTE_STREAM_SPLIT encoding to parquet-format.

On Tue, Nov 5, 2019 at 11:22 PM Wes McKinney <wesmck...@gmail.com> wrote:

> +1 from me on adding the FP encoding
>
> On Sat, Nov 2, 2019 at 4:51 AM Radev, Martin <martin.ra...@tum.de> wrote:
> >
> > Hello all,
> >
> >
> > thanks for the vote Ryan and to Wes for the feedback.
> >
> >
> > The concern with regards to adding more complex features in the Parquet
> spec is valid.
> >
> > However, the proposed encoding is very simple and I already have
> unpolished patches for both parquet-mr and arrow.
> >
> > In its design I purposely opted for something simple to guarantee 1)
> good compression speed and 2) ease of implementation.
> >
> >
> > On the topic of testing, I added four more test cases which were taken
> from here<https://sdrbench.github.io/>. I also added the size in MB of
> all test case and entropy per element.
> >
> > Having the entropy reported helps show that the encoding performs better
> than any other option for high-entropy data and not so well for low-entropy
> data.
> >
> >
> > I would be happy to receive some more feedback and votes.
> >
> >
> > Kind regards,
> >
> > Martin
> >
> > ________________________________
> > From: Ryan Blue <rb...@netflix.com.INVALID>
> > Sent: Friday, November 1, 2019 6:28 PM
> > To: Parquet Dev
> > Cc: Raoofy, Amir; Karlstetter, Roman
> > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> >
> > I'm +1 for adding the definition of BYTE_STREAM_SPLIT to the format.
> Looks
> > like it is simple and performs well. We could use a good floating point
> > encoding.
> >
> > I don't think I agree that differences in features between the Java and
> CPP
> > implementations should hold back new work. It would be great to have more
> > testing and validation, as well as more thorough implementations. But I
> > don't think we shouldn't accept contributions like this because of those
> > concerns.
> >
> > On Fri, Nov 1, 2019 at 9:27 AM Wes McKinney <wesmck...@gmail.com> wrote:
> >
> > > I have to say I'm struggling with piling more things into the Parquet
> > > specification when we already have a significant implementation
> > > shortfall in other areas. LZ4 is still not properly implemented for
> > > example, and then there is the question of the V2 encodings and data
> > > page formats.
> > >
> > > I'm generally in favor of adding more efficient storage of floating
> > > point data, but will it actually be implemented broadly? Parquet as a
> > > format already has become an "implementation swamp" where any two
> > > implementations may not be compatible with each other, particularly in
> > > consideration of more advanced features.
> > >
> > > For a single organization using a single implementation, having
> > > advanced features may be useful, so I see the benefits to users that
> > > tightly control what code and what settings to use.
> > >
> > > On Thu, Oct 31, 2019 at 3:51 AM Radev, Martin <martin.ra...@tum.de>
> wrote:
> > > >
> > > > Dear all,
> > > >
> > > >
> > > > would there be any interest in reviewing the BYTE_STREAM_SPLIT
> encoding?
> > > >
> > > > Please feel free to contact me directly if you need help or would
> like
> > > to provide more test data.
> > > >
> > > >
> > > > Results for the encoding based on the implementation in Arrow are
> here:
> > > https://github.com/martinradev/arrow-fp-compression-bench
> > > > Patch to Arrow is here:
> > >
> https://github.com/martinradev/arrow/commit/10de1e0f8a513b742edddeb6ba0d553617b1aa49
> > > >
> > > >
> > > > The new encoding combined with a compressor performs better than any
> of
> > > the other alternatives for data where there is little change in the
> > > upper-most bytes of fp32 and fp64 values. My early experiments also
> show
> > > that this encoding+zstd performs better on average than any of the
> > > specialized floating-point lossless compressors like fpc, spdp, zfp.
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Martin
> > > >
> > > > ________________________________
> > > > From: Radev, Martin <martin.ra...@tum.de>
> > > > Sent: Thursday, October 10, 2019 2:34:15 PM
> > > > To: Parquet Dev
> > > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> > > >
> > > > Dear Ryan Blue and other Parquet developers,
> > > >
> > > > I tested Ryan's proposal for modifying the encoding.
> > > >
> > > > The short answer is that it doesn't perform well in my tests. The
> > > encoding, code and results can be viewed below.
> > > >
> > > >
> > > > The current implementation only handles 32-bit IEEE754 floats in the
> > > following way:
> > > >
> > > >   1.  For each block of 128 values, the min and max is computed for
> the
> > > exponent
> > > >   2.  The number of bits for the exponent RLE is computed as
> > > ceil(log2((max - min + 1))). The sign bit uses 1 bit.
> > > >   3.  The sign, exponent and 23 remaining mantissa bits are
> extracted.
> > > >   4.  One RLE encoder is used for the sign and one for the exponent.
> > > > A new RLE encoder for the exponent is created if the block requires
> less
> > > or more bits than the number of bits used for the current encoder.
> > > > The 23 mantissa bits are divided into three streams. (Not sure
> whether
> > > this is strictly a good idea).
> > > >   5.  Also, for each 128 values we need to store 2 bytes: the min
> value
> > > and the number of bits used by the RLE.
> > > >
> > > >
> > > > I did not implement a decoder and have not added unit tests to
> guarantee
> > > that the implementation is sound.
> > > >
> > > > Ryan, can you please review the implementation as to whether it
> > > corresponds to what you had in mind?
> > > > Check FPByteStreamSplitComponentRLEEncoder. Please do not focus on
> the
> > > code quality - this is only a quick hack.
> > > >
> > >
> https://github.com/martinradev/arrow/commit/7c1cc8b2a5ff9e6288c615c2acc44b1dce1bd773#diff-47fe879cb9baad6c633c55f0a34a09c3R979
> > > >
> > > > In my benchmark tests I compared the following scenarios:
> > > >
> > > >   *   dictionary
> > > >   *   plain encoding + ZSTD
> > > >   *   the StreamSplit+RLE implementation above
> > > >   *   the StreamSplit+RLE implementation above + ZSTD
> > > >   *   the original BYTE_STREAM_SPLIT proposal + ZSTD
> > > >
> > > > Size of parquet files, measured in MB, for each combination:
> > > >
> > > >         Plain   Plain+ZSTD      Dict    Split+RLE
>  Split+RLE+ZSTD
> > > StreamSplit+ZSTD
> > > > msg_bt.sp
> > > >         128     112     128     113     93      84
> > > > msg_lu.sp
> > > >         93      87      94      80      73      67
> > > > msg_sp.sp
> > > >         139     125     139     123     96      88
> > > > msg_sweep3d.sp
> > > >         60      19      60      47      40      13
> > > > num_brain.sp
> > > >         68      60      69      55      54      49
> > > > num_comet.sp
> > > >         52      45      50      46      42      38
> > > > num_control.sp
> > > >         77      71      77      70      69      63
> > > > num_plasma.sp
> > > >         17      2       8       16      8       1
> > > > obs_error.sp
> > > >         30      24      30      25      22      21
> > > > obs_info.sp
> > > >         10      8       10      8       7       7
> > > > obs_spitzer.sp
> > > >         95      83      95      99      82      72
> > > > obs_temp.sp
> > > >         20      18      20      18      18      16
> > > >
> > > >
> > > > I have not tested the encoding on any of the other test data we have
> > > since they contain also FP64 columns. I did not add support for FP64
> in the
> > > StreamSplit+RLE encoding to save on time and also because I do not
> expect
> > > much improvement.
> > > >
> > > >
> > > > From the results it can be seen that the proposed StreamSplit+RLE
> > > encoding does not improve results.
> > > >
> > > > Using RLE for the sign bits and exponent obfuscates the input and the
> > > compression step cannot fully take advantage of repetitions in the data
> > > since they were removed from the RLE step. Repetitive patterns are
> replaced
> > > by the RLE bits which likely do not compress very well.
> > > >
> > > > Also, GZIP/ZSTD handle repetitions in data better on average. For
> > > example, GZIP's Deflate algorithm can encode a long run length with 3
> > > bytes(not sure whether it can be less?) for the raw data + 3 bytes for
> the
> > > reference ( 8 bits + 15 bits + 2 bits ).
> > > >
> > > > Now, the RLE step might produce better results for long runs of the
> same
> > > value. However, compressors also handles more complex cases when we
> have a
> > > pattern in the data which doesn't necessary have a long run length.
> Also,
> > > compressors like GZIP/ZSTD often do entropy-coding-aware parsing (
> > > http://cbloomrants.blogspot.com/2008/10/10-10-08-7_10.html ,
> > cbloom rants: 10-10-08 : On LZ Optimal Parsing<
> http://cbloomrants.blogspot.com/2008/10/10-10-08-7_10.html>
> > cbloomrants.blogspot.com
> > 10-10-08 On LZ Optimal Parsing So the LZ-Huffman I've done for RAD/Oodle
> does some new stuff with optimal parsing. I want to write...
> >
> >
> >
> > >
> https://martinradev.github.io/jekyll/update/2019/05/29/writing-a-pe32-x86-exe-packer.html
> > > ).
> > > >
> > > > The proposed RLE encoding+ZSTD is slower than BYTE_STREAM_SPLIT+ZSTD.
> > > The reason is that BYTE_STREAM_SPLIT only does a scattered memcpy
> where as
> > > the new RLE encoding has to extract the components and run through each
> > > block of 128 values twice - once to compute min and max, and once to
> > > encode. There's also the overhead of using the Arrow RLEEncoder.
> > > >
> > > > Ryan and other folks, can you provide feedback? Does the
> implementation
> > > look reasonable to you?
> > > >
> > > > Can somebody please work with me on this new encoding? There has been
> > > some interest and some discussions but it hasn't been pushed likely
> due to
> > > work around the current release.
> > > > For a bit more discussions and results, please refer to:
> > > > Recent benchmark of Arrow implementation:
> > > https://github.com/martinradev/arrow-fp-compression-bench
> > > > Separate report:
> > >
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> > [
> https://lh3.googleusercontent.com/0Nyayc6yUgH07IH12mpd3FJ8OZ7MX282uaxarQ0ffc5sJT_-hiMR5aw60Yg=w1200-h630-p
> ]<
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> >
> >
> > report.pdf - Google Drive<
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> >
> > drive.google.com
> >
> >
> >
> > > >
> > > > Regards,
> > > > Martin
> > > >
> > > >
> > > >
> > > > ________________________________
> > > > From: Ryan Blue <rb...@netflix.com>
> > > > Sent: Thursday, September 19, 2019 7:54 PM
> > > > To: Radev, Martin
> > > > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> > > >
> > > > Sounds good, thanks for working on this!
> > > >
> > > > On Thu, Sep 19, 2019 at 6:10 AM Radev, Martin <martin.ra...@tum.de
> > > <mailto:martin.ra...@tum.de>> wrote:
> > > >
> > > > Hello Ryan,
> > > >
> > > >
> > > > we decided that it would be beneficial to try out your proposal.
> > > >
> > > >
> > > > I will look into it and provide measurements on the compression ratio
> > > and speed.
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Martin
> > > >
> > > > ________________________________
> > > > From: Ryan Blue <rb...@netflix.com<mailto:rb...@netflix.com>>
> > > > Sent: Saturday, September 14, 2019 2:23:20 AM
> > > > To: Radev, Martin
> > > > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> > > >
> > > > > Using RLE for the sign, exponents and the top-most mantissa bytes
> can
> > > help when data is repetitive and make it worse for other.
> > > >
> > > > I agree. But we use RLE in similar cases because we do tend to have
> runs
> > > of values, and values that fit in a fixed number of bits. Exponents and
> > > sign bits would probably fit this model extremely well most of the
> time if
> > > you have similar floating point values or sorted values. It would be
> really
> > > interesting to see how well this performs in comparison to the
> compression
> > > tests you've already done. For mantissa bits, I agree it wouldn't be
> worth
> > > encoding first.
> > > >
> > > > On Thu, Sep 12, 2019 at 2:56 AM Radev, Martin <martin.ra...@tum.de
> > > <mailto:martin.ra...@tum.de>> wrote:
> > > >
> > > > Hello Ryan, Wes and other parquet devs,
> > > >
> > > >
> > > > thanks for the response. I was away on vacation and that's why I am
> > > answering just now.
> > > >
> > > >
> > > > > whether you think adding run-length encoding to any of the byte
> > > streams would be beneficial before applying Zstd.
> > > > The short answer is "only for some cases but it will make it worse in
> > > both compression ratio and speed for other".
> > > >
> > > > Our initial investigation also separated the sign, exponent and
> mantissa
> > > into separate streams.
> > > >
> > > > The encoding was the following assuming 32-bit IEEE754:
> > > >
> > > > - stream of sign bits
> > > >
> > > > - stream of exponents bits. Conveniently the exponent for a 32-bit
> > > IEEE754 number is 8 bits.
> > > >
> > > > - separate the remaining 23 bits into four streams of 8, 8, 7 bits.
> An
> > > extra zero bit is added to the block which has only seven bits. This
> was
> > > done since zstd, zlib, etc work at a byte granularity and we would want
> > > repetitions to happen at such.
> > > >
> > > > For 64-bit IEEE754 even more padding has to be added since the
> exponent
> > > is 11 bits and the mantissa is 52 bits. Thus, we have to add 5 more
> > > exponent bits and 4 more mantissa bits to keep repetitions at a byte
> > > granularity. My original report shows results for when the
> floating-point
> > > values are split at a component granularity. Report is here:
> > > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
> > > > Results are just slightly better in terms of compression ratio for
> some
> > > tests but compression and decompression speed is expectedly worse. The
> > > reason is that splitting a value is somewhat more complex. We need to
> keep
> > > a stream of bits for the signs, keep track of when a byte in the
> stream is
> > > exhausted, do bit manipulation to extract components, etc. This is
> also the
> > > reason why I preferred to go with the byte-wise decomposition of the
> > > values. It's faster and the compression ratio is just slightly worse
> for
> > > some of the tests.
> > > >
> > > >
> > > > Using RLE for the sign, exponents and the top-most mantissa bytes can
> > > help when data is repetitive and make it worse for other. I suppose
> using
> > > one of the compressors yields a better compression ratio on average.
> Also,
> > > this can again make encoding and decoding slower.
> > > >
> > > >
> > > > The design of the BYTE_STREAM_SPLIT encoding had in mind two things:
> > > >
> > > > - It would only make data more compressible and leave compression to
> the
> > > codec in use.
> > > >   This leaves the complexity to the codec and choice of
> > > speed/compression ratio to the user.
> > > >
> > > > - It should be fast.
> > > >   There's an extra compression step so preferably there's very little
> > > latency before it.
> > > >
> > > > @Wes, can you have a look?
> > > >
> > > > More opinions are welcome.
> > > >
> > > > If you have floating point data available, I would be very happy to
> > > examine whether this approach offers benefit for you.
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Martin
> > > >
> > > > ________________________________
> > > > From: Ryan Blue <rb...@netflix.com.INVALID>
> > > > Sent: Tuesday, September 3, 2019 11:51:46 PM
> > > > To: Parquet Dev
> > > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> > > >
> > > > Hi Martin,
> > > >
> > > > Thanks for taking a look at this! I agree that the approach here
> looks
> > > > promising. We've had occasional requests for lossy floating point
> > > > compression in the past, so it would be good to add this.
> > > >
> > > > I did some work in this area a few years ago that is similar and I'd
> like
> > > > to hear what you think about that approach compared to this one. That
> > > work
> > > > was based on the same observation, that the main problem is the
> mantissa,
> > > > while exponents tend to compress well. What I did was take the
> exponent
> > > and
> > > > mantissa and encode each separately, like the component encoding in
> your
> > > > test. But to encode each stream, I used Parquet's RLE encoder
> instead of
> > > > just applying compression. This seemed to work well for exponents and
> > > sign
> > > > bits, but probably isn't worth the cost for mantissa bits. It could
> also
> > > be
> > > > interesting to test a separate stream for sign bits.
> > > >
> > > > I guess what I'd like to hear your take on is whether you think
> adding
> > > > run-length encoding to any of the byte streams would be beneficial
> before
> > > > applying Zstd.
> > > >
> > > > Thanks!
> > > >
> > > > rb
> > > >
> > > > On Tue, Sep 3, 2019 at 12:30 PM Wes McKinney <wesmck...@gmail.com
> > > <mailto:wesmck...@gmail.com>> wrote:
> > > >
> > > > > I'm interested in this. I have been busy the last couple of weeks
> so
> > > have
> > > > > not been able to take a closer look. I will try to give some
> feedback
> > > this
> > > > > week.
> > > > >
> > > > > Thanks
> > > > >
> > > > > On Tue, Sep 3, 2019, 2:17 PM Radev, Martin <martin.ra...@tum.de
> > > <mailto:martin.ra...@tum.de>> wrote:
> > > > >
> > > > > > Hello all,
> > > > > >
> > > > > >
> > > > > > thank you Julien for the interest.
> > > > > >
> > > > > >
> > > > > > Could other people, part of Apache Parquet, share their opinions?
> > > > > >
> > > > > > Do you have your own data which you would like to use for testing
> > > the new
> > > > > > encoding?
> > > > > >
> > > > > >
> > > > > > Regards,
> > > > > >
> > > > > > Martin
> > > > > >
> > > > > > ________________________________
> > > > > > From: Julien Le Dem <julien.le...@wework.com.INVALID>
> > > > > > Sent: Friday, August 30, 2019 2:38:37 AM
> > > > > > To: dev@parquet.apache.org<mailto:dev@parquet.apache.org>
> > > > > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache
> Parquet
> > > > > >
> > > > > > I think this looks promising to me. At first glance it seems
> > > combining
> > > > > > simplicity and efficiency.
> > > > > > I'd like to hear more from other members of the PMC.
> > > > > >
> > > > > > On Tue, Aug 27, 2019 at 5:30 AM Radev, Martin <
> martin.ra...@tum.de
> > > <mailto:martin.ra...@tum.de>>
> > > > > wrote:
> > > > > >
> > > > > > > Dear all,
> > > > > > >
> > > > > > >
> > > > > > > there was some earlier discussion on adding a new encoding for
> > > better
> > > > > > > compression of FP32 and FP64 data.
> > > > > > >
> > > > > > >
> > > > > > > The pull request which extends the format is here:
> > > > > > > https://github.com/apache/parquet-format/pull/144
> > > > > > > The change has one approval from earlier from Zoltan.
> > > > > > >
> > > > > > >
> > > > > > > The results from an investigation on compression ratio and
> speed
> > > with
> > > > > the
> > > > > > > new encoding vs other encodings is available here:
> > > > > > > https://github.com/martinradev/arrow-fp-compression-bench
> > > > > > > It is visible that for many tests the new encoding performs
> better
> > > in
> > > > > > > compression ratio and in some cases in speed. The improvements
> in
> > > > > > > compression speed come from the fact that the new format can
> > > > > potentially
> > > > > > > lead to a faster parsing for some compressors like GZIP.
> > > > > > >
> > > > > > >
> > > > > > > An earlier report which examines other FP compressors (fpzip,
> spdp,
> > > > > fpc,
> > > > > > > zfp, sz) and new potential encodings is available here:
> > > > > > >
> > > > > >
> > > > >
> > >
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> > > > > > > The report also covers lossy compression but the
> BYTE_STREAM_SPLIT
> > > > > > > encoding only has the focus of lossless compression.
> > > > > > >
> > > > > > >
> > > > > > > Can we have a vote?
> > > > > > >
> > > > > > >
> > > > > > > Regards,
> > > > > > >
> > > > > > > Martin
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Ryan Blue
> > > > Software Engineer
> > > > Netflix
> > > >
> > > >
> > > > --
> > > > Ryan Blue
> > > > Software Engineer
> > > > Netflix
> > > >
> > > >
> > > > --
> > > > Ryan Blue
> > > > Software Engineer
> > > > Netflix
> > >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>

Reply via email to