Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

Gabor Szadovszky Tue, 03 Dec 2019 00:28:30 -0800

Hi Martin,

The vote succeeded. 4 +1 votes (3 bindings) and no -1 votes.
I'll submit your PR soon.


Cheers,
Gabor

On Fri, Nov 29, 2019 at 10:59 PM Radev, Martin <[email protected]> wrote:

> Dear Parquet community,
>
> thank you for the votes and time spent reviewing.
>
> There have been so far 4 votes in the mailing list and Zoltan added a +1
> to the Github pull request some time ago.
> How many more would be necessary to have the spec change accepted?
> Once that is done I can continue with polishing my changes for Arrow and
> Parquet-MR.
>
> There are some new things.
> I did some further tests and implementation changes (no spec changes):
>
>   1.  I added more real data tests from various sources.
>   2.  I generated synthetic data with varying zero-order entropy of an
> element for each test file.
> This should help give some idea when different encoding-compressor options
> perform well.
>   3.  I investigated different possible implementations for the
> encoder/decoder and added the most efficient option locally into Arrow's
> parquet-cpp.
> This was necessary since I found out that my early scalar implementation
> in Arrow did not get auto-vectorized by the compiler and thus was somewhat
> inefficient especially if the data is present in the cache.
>
> The report is available here:
> https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/report_final.pdf.
> It mostly contains tables and plots with some context information and
> reasoning.
> The two SIMD implementations I wrote are available here in a standalone
> program for benchmarking and validating:
> https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/prog.cpp
>
> The short story is that the BYTE_STREAM_SPLIT encoding performs also the
> best among all other options for the newly added test cases.
> Furthermore, the plot for data with varying entropy shows that the
> BYTE_STREAM_SPLIT encoding with zstd provides the best compression ratio
> when the entropy per float element is >8 bits. The BYTE_STREAM_SPLIT
> encoding+zstd provides better write throughput than all other combinations.
> It also improves read throughput than to just using ZSTD to compress data.
> The report also shows the performance difference for different
> implementations of the encoder/decoder and compare it to memcpy. There's a
> clear performance advantage in using the custom vectorized version over
> what GCC generates.
>
> If you have any questions, let me know.
>
> Kind regards,
> Martin
>
>
> ________________________________
> From: Junjie Chen <[email protected]>
> Sent: Friday, November 8, 2019 2:56:48 AM
> To: [email protected]
> Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
>
> +1 from me to add BYTE_STREAM_SPLIT to parquet-format.
>
> Gabor Szadovszky <[email protected]> 于2019年11月7日周四 下午6:07写道：
> >
> > +1 for adding BYTE_STREAM_SPLIT encoding to parquet-format.
> >
> > On Tue, Nov 5, 2019 at 11:22 PM Wes McKinney <[email protected]>
> wrote:
> >
> > > +1 from me on adding the FP encoding
> > >
> > > On Sat, Nov 2, 2019 at 4:51 AM Radev, Martin <[email protected]>
> wrote:
> > > >
> > > > Hello all,
> > > >
> > > >
> > > > thanks for the vote Ryan and to Wes for the feedback.
> > > >
> > > >
> > > > The concern with regards to adding more complex features in the
> Parquet
> > > spec is valid.
> > > >
> > > > However, the proposed encoding is very simple and I already have
> > > unpolished patches for both parquet-mr and arrow.
> > > >
> > > > In its design I purposely opted for something simple to guarantee 1)
> > > good compression speed and 2) ease of implementation.
> > > >
> > > >
> > > > On the topic of testing, I added four more test cases which were
> taken
> > > from here<https://sdrbench.github.io/>. I also added the size in MB of
> > > all test case and entropy per element.
> > > >
> > > > Having the entropy reported helps show that the encoding performs
> better
> > > than any other option for high-entropy data and not so well for
> low-entropy
> > > data.
> > > >
> > > >
> > > > I would be happy to receive some more feedback and votes.
> > > >
> > > >
> > > > Kind regards,
> > > >
> > > > Martin
> > > >
> > > > ________________________________
> > > > From: Ryan Blue <[email protected]>
> > > > Sent: Friday, November 1, 2019 6:28 PM
> > > > To: Parquet Dev
> > > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> > > >
> > > > I'm +1 for adding the definition of BYTE_STREAM_SPLIT to the format.
> > > Looks
> > > > like it is simple and performs well. We could use a good floating
> point
> > > > encoding.
> > > >
> > > > I don't think I agree that differences in features between the Java
> and
> > > CPP
> > > > implementations should hold back new work. It would be great to have
> more
> > > > testing and validation, as well as more thorough implementations.
> But I
> > > > don't think we shouldn't accept contributions like this because of
> those
> > > > concerns.
> > > >
> > > > On Fri, Nov 1, 2019 at 9:27 AM Wes McKinney <[email protected]>
> wrote:
> > > >
> > > > > I have to say I'm struggling with piling more things into the
> Parquet
> > > > > specification when we already have a significant implementation
> > > > > shortfall in other areas. LZ4 is still not properly implemented for
> > > > > example, and then there is the question of the V2 encodings and
> data
> > > > > page formats.
> > > > >
> > > > > I'm generally in favor of adding more efficient storage of floating
> > > > > point data, but will it actually be implemented broadly? Parquet
> as a
> > > > > format already has become an "implementation swamp" where any two
> > > > > implementations may not be compatible with each other,
> particularly in
> > > > > consideration of more advanced features.
> > > > >
> > > > > For a single organization using a single implementation, having
> > > > > advanced features may be useful, so I see the benefits to users
> that
> > > > > tightly control what code and what settings to use.
> > > > >
> > > > > On Thu, Oct 31, 2019 at 3:51 AM Radev, Martin <[email protected]
> >
> > > wrote:
> > > > > >
> > > > > > Dear all,
> > > > > >
> > > > > >
> > > > > > would there be any interest in reviewing the BYTE_STREAM_SPLIT
> > > encoding?
> > > > > >
> > > > > > Please feel free to contact me directly if you need help or would
> > > like
> > > > > to provide more test data.
> > > > > >
> > > > > >
> > > > > > Results for the encoding based on the implementation in Arrow are
> > > here:
> > > > > https://github.com/martinradev/arrow-fp-compression-bench
> > > > > > Patch to Arrow is here:
> > > > >
> > >
> https://github.com/martinradev/arrow/commit/10de1e0f8a513b742edddeb6ba0d553617b1aa49
> > > > > >
> > > > > >
> > > > > > The new encoding combined with a compressor performs better than
> any
> > > of
> > > > > the other alternatives for data where there is little change in the
> > > > > upper-most bytes of fp32 and fp64 values. My early experiments also
> > > show
> > > > > that this encoding+zstd performs better on average than any of the
> > > > > specialized floating-point lossless compressors like fpc, spdp,
> zfp.
> > > > > >
> > > > > >
> > > > > > Regards,
> > > > > >
> > > > > > Martin
> > > > > >
> > > > > > ________________________________
> > > > > > From: Radev, Martin <[email protected]>
> > > > > > Sent: Thursday, October 10, 2019 2:34:15 PM
> > > > > > To: Parquet Dev
> > > > > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache
> Parquet
> > > > > >
> > > > > > Dear Ryan Blue and other Parquet developers,
> > > > > >
> > > > > > I tested Ryan's proposal for modifying the encoding.
> > > > > >
> > > > > > The short answer is that it doesn't perform well in my tests. The
> > > > > encoding, code and results can be viewed below.
> > > > > >
> > > > > >
> > > > > > The current implementation only handles 32-bit IEEE754 floats in
> the
> > > > > following way:
> > > > > >
> > > > > >   1.  For each block of 128 values, the min and max is computed
> for
> > > the
> > > > > exponent
> > > > > >   2.  The number of bits for the exponent RLE is computed as
> > > > > ceil(log2((max - min + 1))). The sign bit uses 1 bit.
> > > > > >   3.  The sign, exponent and 23 remaining mantissa bits are
> > > extracted.
> > > > > >   4.  One RLE encoder is used for the sign and one for the
> exponent.
> > > > > > A new RLE encoder for the exponent is created if the block
> requires
> > > less
> > > > > or more bits than the number of bits used for the current encoder.
> > > > > > The 23 mantissa bits are divided into three streams. (Not sure
> > > whether
> > > > > this is strictly a good idea).
> > > > > >   5.  Also, for each 128 values we need to store 2 bytes: the min
> > > value
> > > > > and the number of bits used by the RLE.
> > > > > >
> > > > > >
> > > > > > I did not implement a decoder and have not added unit tests to
> > > guarantee
> > > > > that the implementation is sound.
> > > > > >
> > > > > > Ryan, can you please review the implementation as to whether it
> > > > > corresponds to what you had in mind?
> > > > > > Check FPByteStreamSplitComponentRLEEncoder. Please do not focus
> on
> > > the
> > > > > code quality - this is only a quick hack.
> > > > > >
> > > > >
> > >
> https://github.com/martinradev/arrow/commit/7c1cc8b2a5ff9e6288c615c2acc44b1dce1bd773#diff-47fe879cb9baad6c633c55f0a34a09c3R979
> > > > > >
> > > > > > In my benchmark tests I compared the following scenarios:
> > > > > >
> > > > > >   *   dictionary
> > > > > >   *   plain encoding + ZSTD
> > > > > >   *   the StreamSplit+RLE implementation above
> > > > > >   *   the StreamSplit+RLE implementation above + ZSTD
> > > > > >   *   the original BYTE_STREAM_SPLIT proposal + ZSTD
> > > > > >
> > > > > > Size of parquet files, measured in MB, for each combination:
> > > > > >
> > > > > >         Plain   Plain+ZSTD      Dict    Split+RLE
> > >  Split+RLE+ZSTD
> > > > > StreamSplit+ZSTD
> > > > > > msg_bt.sp
> > > > > >         128     112     128     113     93      84
> > > > > > msg_lu.sp
> > > > > >         93      87      94      80      73      67
> > > > > > msg_sp.sp
> > > > > >         139     125     139     123     96      88
> > > > > > msg_sweep3d.sp
> > > > > >         60      19      60      47      40      13
> > > > > > num_brain.sp
> > > > > >         68      60      69      55      54      49
> > > > > > num_comet.sp
> > > > > >         52      45      50      46      42      38
> > > > > > num_control.sp
> > > > > >         77      71      77      70      69      63
> > > > > > num_plasma.sp
> > > > > >         17      2       8       16      8       1
> > > > > > obs_error.sp
> > > > > >         30      24      30      25      22      21
> > > > > > obs_info.sp
> > > > > >         10      8       10      8       7       7
> > > > > > obs_spitzer.sp
> > > > > >         95      83      95      99      82      72
> > > > > > obs_temp.sp
> > > > > >         20      18      20      18      18      16
> > > > > >
> > > > > >
> > > > > > I have not tested the encoding on any of the other test data we
> have
> > > > > since they contain also FP64 columns. I did not add support for
> FP64
> > > in the
> > > > > StreamSplit+RLE encoding to save on time and also because I do not
> > > expect
> > > > > much improvement.
> > > > > >
> > > > > >
> > > > > > From the results it can be seen that the proposed StreamSplit+RLE
> > > > > encoding does not improve results.
> > > > > >
> > > > > > Using RLE for the sign bits and exponent obfuscates the input
> and the
> > > > > compression step cannot fully take advantage of repetitions in the
> data
> > > > > since they were removed from the RLE step. Repetitive patterns are
> > > replaced
> > > > > by the RLE bits which likely do not compress very well.
> > > > > >
> > > > > > Also, GZIP/ZSTD handle repetitions in data better on average. For
> > > > > example, GZIP's Deflate algorithm can encode a long run length
> with 3
> > > > > bytes(not sure whether it can be less?) for the raw data + 3 bytes
> for
> > > the
> > > > > reference ( 8 bits + 15 bits + 2 bits ).
> > > > > >
> > > > > > Now, the RLE step might produce better results for long runs of
> the
> > > same
> > > > > value. However, compressors also handles more complex cases when we
> > > have a
> > > > > pattern in the data which doesn't necessary have a long run length.
> > > Also,
> > > > > compressors like GZIP/ZSTD often do entropy-coding-aware parsing (
> > > > > http://cbloomrants.blogspot.com/2008/10/10-10-08-7_10.html ,
> > > > cbloom rants: 10-10-08 : On LZ Optimal Parsing<
> > > http://cbloomrants.blogspot.com/2008/10/10-10-08-7_10.html>
> > > > cbloomrants.blogspot.com
> > > > 10-10-08 On LZ Optimal Parsing So the LZ-Huffman I've done for
> RAD/Oodle
> > > does some new stuff with optimal parsing. I want to write...
> > > >
> > > >
> > > >
> > > > >
> > >
> https://martinradev.github.io/jekyll/update/2019/05/29/writing-a-pe32-x86-exe-packer.html
> > > > > ).
> > > > > >
> > > > > > The proposed RLE encoding+ZSTD is slower than
> BYTE_STREAM_SPLIT+ZSTD.
> > > > > The reason is that BYTE_STREAM_SPLIT only does a scattered memcpy
> > > where as
> > > > > the new RLE encoding has to extract the components and run through
> each
> > > > > block of 128 values twice - once to compute min and max, and once
> to
> > > > > encode. There's also the overhead of using the Arrow RLEEncoder.
> > > > > >
> > > > > > Ryan and other folks, can you provide feedback? Does the
> > > implementation
> > > > > look reasonable to you?
> > > > > >
> > > > > > Can somebody please work with me on this new encoding? There has
> been
> > > > > some interest and some discussions but it hasn't been pushed likely
> > > due to
> > > > > work around the current release.
> > > > > > For a bit more discussions and results, please refer to:
> > > > > > Recent benchmark of Arrow implementation:
> > > > > https://github.com/martinradev/arrow-fp-compression-bench
> > > > > > Separate report:
> > > > >
> > >
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> > > > [
> > >
> https://lh3.googleusercontent.com/0Nyayc6yUgH07IH12mpd3FJ8OZ7MX282uaxarQ0ffc5sJT_-hiMR5aw60Yg=w1200-h630-p
> > > ]<
> > >
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> > > >
> > > >
> > > > report.pdf - Google Drive<
> > >
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> > > >
> > > > drive.google.com
> > > >
> > > >
> > > >
> > > > > >
> > > > > > Regards,
> > > > > > Martin
> > > > > >
> > > > > >
> > > > > >
> > > > > > ________________________________
> > > > > > From: Ryan Blue <[email protected]>
> > > > > > Sent: Thursday, September 19, 2019 7:54 PM
> > > > > > To: Radev, Martin
> > > > > > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> > > > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache
> Parquet
> > > > > >
> > > > > > Sounds good, thanks for working on this!
> > > > > >
> > > > > > On Thu, Sep 19, 2019 at 6:10 AM Radev, Martin <
> [email protected]
> > > > > <mailto:[email protected]>> wrote:
> > > > > >
> > > > > > Hello Ryan,
> > > > > >
> > > > > >
> > > > > > we decided that it would be beneficial to try out your proposal.
> > > > > >
> > > > > >
> > > > > > I will look into it and provide measurements on the compression
> ratio
> > > > > and speed.
> > > > > >
> > > > > >
> > > > > > Regards,
> > > > > >
> > > > > > Martin
> > > > > >
> > > > > > ________________________________
> > > > > > From: Ryan Blue <[email protected]<mailto:[email protected]>>
> > > > > > Sent: Saturday, September 14, 2019 2:23:20 AM
> > > > > > To: Radev, Martin
> > > > > > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> > > > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache
> Parquet
> > > > > >
> > > > > > > Using RLE for the sign, exponents and the top-most mantissa
> bytes
> > > can
> > > > > help when data is repetitive and make it worse for other.
> > > > > >
> > > > > > I agree. But we use RLE in similar cases because we do tend to
> have
> > > runs
> > > > > of values, and values that fit in a fixed number of bits.
> Exponents and
> > > > > sign bits would probably fit this model extremely well most of the
> > > time if
> > > > > you have similar floating point values or sorted values. It would
> be
> > > really
> > > > > interesting to see how well this performs in comparison to the
> > > compression
> > > > > tests you've already done. For mantissa bits, I agree it wouldn't
> be
> > > worth
> > > > > encoding first.
> > > > > >
> > > > > > On Thu, Sep 12, 2019 at 2:56 AM Radev, Martin <
> [email protected]
> > > > > <mailto:[email protected]>> wrote:
> > > > > >
> > > > > > Hello Ryan, Wes and other parquet devs,
> > > > > >
> > > > > >
> > > > > > thanks for the response. I was away on vacation and that's why I
> am
> > > > > answering just now.
> > > > > >
> > > > > >
> > > > > > > whether you think adding run-length encoding to any of the byte
> > > > > streams would be beneficial before applying Zstd.
> > > > > > The short answer is "only for some cases but it will make it
> worse in
> > > > > both compression ratio and speed for other".
> > > > > >
> > > > > > Our initial investigation also separated the sign, exponent and
> > > mantissa
> > > > > into separate streams.
> > > > > >
> > > > > > The encoding was the following assuming 32-bit IEEE754:
> > > > > >
> > > > > > - stream of sign bits
> > > > > >
> > > > > > - stream of exponents bits. Conveniently the exponent for a
> 32-bit
> > > > > IEEE754 number is 8 bits.
> > > > > >
> > > > > > - separate the remaining 23 bits into four streams of 8, 8, 7
> bits.
> > > An
> > > > > extra zero bit is added to the block which has only seven bits.
> This
> > > was
> > > > > done since zstd, zlib, etc work at a byte granularity and we would
> want
> > > > > repetitions to happen at such.
> > > > > >
> > > > > > For 64-bit IEEE754 even more padding has to be added since the
> > > exponent
> > > > > is 11 bits and the mantissa is 52 bits. Thus, we have to add 5 more
> > > > > exponent bits and 4 more mantissa bits to keep repetitions at a
> byte
> > > > > granularity. My original report shows results for when the
> > > floating-point
> > > > > values are split at a component granularity. Report is here:
> > > > >
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
> > > > > > Results are just slightly better in terms of compression ratio
> for
> > > some
> > > > > tests but compression and decompression speed is expectedly worse.
> The
> > > > > reason is that splitting a value is somewhat more complex. We need
> to
> > > keep
> > > > > a stream of bits for the signs, keep track of when a byte in the
> > > stream is
> > > > > exhausted, do bit manipulation to extract components, etc. This is
> > > also the
> > > > > reason why I preferred to go with the byte-wise decomposition of
> the
> > > > > values. It's faster and the compression ratio is just slightly
> worse
> > > for
> > > > > some of the tests.
> > > > > >
> > > > > >
> > > > > > Using RLE for the sign, exponents and the top-most mantissa
> bytes can
> > > > > help when data is repetitive and make it worse for other. I suppose
> > > using
> > > > > one of the compressors yields a better compression ratio on
> average.
> > > Also,
> > > > > this can again make encoding and decoding slower.
> > > > > >
> > > > > >
> > > > > > The design of the BYTE_STREAM_SPLIT encoding had in mind two
> things:
> > > > > >
> > > > > > - It would only make data more compressible and leave
> compression to
> > > the
> > > > > codec in use.
> > > > > >   This leaves the complexity to the codec and choice of
> > > > > speed/compression ratio to the user.
> > > > > >
> > > > > > - It should be fast.
> > > > > >   There's an extra compression step so preferably there's very
> little
> > > > > latency before it.
> > > > > >
> > > > > > @Wes, can you have a look?
> > > > > >
> > > > > > More opinions are welcome.
> > > > > >
> > > > > > If you have floating point data available, I would be very happy
> to
> > > > > examine whether this approach offers benefit for you.
> > > > > >
> > > > > >
> > > > > > Regards,
> > > > > >
> > > > > > Martin
> > > > > >
> > > > > > ________________________________
> > > > > > From: Ryan Blue <[email protected]>
> > > > > > Sent: Tuesday, September 3, 2019 11:51:46 PM
> > > > > > To: Parquet Dev
> > > > > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache
> Parquet
> > > > > >
> > > > > > Hi Martin,
> > > > > >
> > > > > > Thanks for taking a look at this! I agree that the approach here
> > > looks
> > > > > > promising. We've had occasional requests for lossy floating point
> > > > > > compression in the past, so it would be good to add this.
> > > > > >
> > > > > > I did some work in this area a few years ago that is similar and
> I'd
> > > like
> > > > > > to hear what you think about that approach compared to this one.
> That
> > > > > work
> > > > > > was based on the same observation, that the main problem is the
> > > mantissa,
> > > > > > while exponents tend to compress well. What I did was take the
> > > exponent
> > > > > and
> > > > > > mantissa and encode each separately, like the component encoding
> in
> > > your
> > > > > > test. But to encode each stream, I used Parquet's RLE encoder
> > > instead of
> > > > > > just applying compression. This seemed to work well for
> exponents and
> > > > > sign
> > > > > > bits, but probably isn't worth the cost for mantissa bits. It
> could
> > > also
> > > > > be
> > > > > > interesting to test a separate stream for sign bits.
> > > > > >
> > > > > > I guess what I'd like to hear your take on is whether you think
> > > adding
> > > > > > run-length encoding to any of the byte streams would be
> beneficial
> > > before
> > > > > > applying Zstd.
> > > > > >
> > > > > > Thanks!
> > > > > >
> > > > > > rb
> > > > > >
> > > > > > On Tue, Sep 3, 2019 at 12:30 PM Wes McKinney <
> [email protected]
> > > > > <mailto:[email protected]>> wrote:
> > > > > >
> > > > > > > I'm interested in this. I have been busy the last couple of
> weeks
> > > so
> > > > > have
> > > > > > > not been able to take a closer look. I will try to give some
> > > feedback
> > > > > this
> > > > > > > week.
> > > > > > >
> > > > > > > Thanks
> > > > > > >
> > > > > > > On Tue, Sep 3, 2019, 2:17 PM Radev, Martin <
> [email protected]
> > > > > <mailto:[email protected]>> wrote:
> > > > > > >
> > > > > > > > Hello all,
> > > > > > > >
> > > > > > > >
> > > > > > > > thank you Julien for the interest.
> > > > > > > >
> > > > > > > >
> > > > > > > > Could other people, part of Apache Parquet, share their
> opinions?
> > > > > > > >
> > > > > > > > Do you have your own data which you would like to use for
> testing
> > > > > the new
> > > > > > > > encoding?
> > > > > > > >
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > >
> > > > > > > > Martin
> > > > > > > >
> > > > > > > > ________________________________
> > > > > > > > From: Julien Le Dem <[email protected]>
> > > > > > > > Sent: Friday, August 30, 2019 2:38:37 AM
> > > > > > > > To: [email protected]<mailto:[email protected]>
> > > > > > > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > > > > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache
> > > Parquet
> > > > > > > >
> > > > > > > > I think this looks promising to me. At first glance it seems
> > > > > combining
> > > > > > > > simplicity and efficiency.
> > > > > > > > I'd like to hear more from other members of the PMC.
> > > > > > > >
> > > > > > > > On Tue, Aug 27, 2019 at 5:30 AM Radev, Martin <
> > > [email protected]
> > > > > <mailto:[email protected]>>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Dear all,
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > there was some earlier discussion on adding a new encoding
> for
> > > > > better
> > > > > > > > > compression of FP32 and FP64 data.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > The pull request which extends the format is here:
> > > > > > > > > https://github.com/apache/parquet-format/pull/144
> > > > > > > > > The change has one approval from earlier from Zoltan.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > The results from an investigation on compression ratio and
> > > speed
> > > > > with
> > > > > > > the
> > > > > > > > > new encoding vs other encodings is available here:
> > > > > > > > > https://github.com/martinradev/arrow-fp-compression-bench
> > > > > > > > > It is visible that for many tests the new encoding performs
> > > better
> > > > > in
> > > > > > > > > compression ratio and in some cases in speed. The
> improvements
> > > in
> > > > > > > > > compression speed come from the fact that the new format
> can
> > > > > > > potentially
> > > > > > > > > lead to a faster parsing for some compressors like GZIP.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > An earlier report which examines other FP compressors
> (fpzip,
> > > spdp,
> > > > > > > fpc,
> > > > > > > > > zfp, sz) and new potential encodings is available here:
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > >
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> > > > > > > > > The report also covers lossy compression but the
> > > BYTE_STREAM_SPLIT
> > > > > > > > > encoding only has the focus of lossless compression.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Can we have a vote?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > >
> > > > > > > > > Martin
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Ryan Blue
> > > > > > Software Engineer
> > > > > > Netflix
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Ryan Blue
> > > > > > Software Engineer
> > > > > > Netflix
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Ryan Blue
> > > > > > Software Engineer
> > > > > > Netflix
> > > > >
> > > >
> > > >
> > > > --
> > > > Ryan Blue
> > > > Software Engineer
> > > > Netflix
> > >
>

Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

Reply via email to