+1 from me on adding the FP encoding
On Sat, Nov 2, 2019 at 4:51 AM Radev, Martin <[email protected]> wrote: > > Hello all, > > > thanks for the vote Ryan and to Wes for the feedback. > > > The concern with regards to adding more complex features in the Parquet spec > is valid. > > However, the proposed encoding is very simple and I already have unpolished > patches for both parquet-mr and arrow. > > In its design I purposely opted for something simple to guarantee 1) good > compression speed and 2) ease of implementation. > > > On the topic of testing, I added four more test cases which were taken from > here<https://sdrbench.github.io/>. I also added the size in MB of all test > case and entropy per element. > > Having the entropy reported helps show that the encoding performs better than > any other option for high-entropy data and not so well for low-entropy data. > > > I would be happy to receive some more feedback and votes. > > > Kind regards, > > Martin > > ________________________________ > From: Ryan Blue <[email protected]> > Sent: Friday, November 1, 2019 6:28 PM > To: Parquet Dev > Cc: Raoofy, Amir; Karlstetter, Roman > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet > > I'm +1 for adding the definition of BYTE_STREAM_SPLIT to the format. Looks > like it is simple and performs well. We could use a good floating point > encoding. > > I don't think I agree that differences in features between the Java and CPP > implementations should hold back new work. It would be great to have more > testing and validation, as well as more thorough implementations. But I > don't think we shouldn't accept contributions like this because of those > concerns. > > On Fri, Nov 1, 2019 at 9:27 AM Wes McKinney <[email protected]> wrote: > > > I have to say I'm struggling with piling more things into the Parquet > > specification when we already have a significant implementation > > shortfall in other areas. LZ4 is still not properly implemented for > > example, and then there is the question of the V2 encodings and data > > page formats. > > > > I'm generally in favor of adding more efficient storage of floating > > point data, but will it actually be implemented broadly? Parquet as a > > format already has become an "implementation swamp" where any two > > implementations may not be compatible with each other, particularly in > > consideration of more advanced features. > > > > For a single organization using a single implementation, having > > advanced features may be useful, so I see the benefits to users that > > tightly control what code and what settings to use. > > > > On Thu, Oct 31, 2019 at 3:51 AM Radev, Martin <[email protected]> wrote: > > > > > > Dear all, > > > > > > > > > would there be any interest in reviewing the BYTE_STREAM_SPLIT encoding? > > > > > > Please feel free to contact me directly if you need help or would like > > to provide more test data. > > > > > > > > > Results for the encoding based on the implementation in Arrow are here: > > https://github.com/martinradev/arrow-fp-compression-bench > > > Patch to Arrow is here: > > https://github.com/martinradev/arrow/commit/10de1e0f8a513b742edddeb6ba0d553617b1aa49 > > > > > > > > > The new encoding combined with a compressor performs better than any of > > the other alternatives for data where there is little change in the > > upper-most bytes of fp32 and fp64 values. My early experiments also show > > that this encoding+zstd performs better on average than any of the > > specialized floating-point lossless compressors like fpc, spdp, zfp. > > > > > > > > > Regards, > > > > > > Martin > > > > > > ________________________________ > > > From: Radev, Martin <[email protected]> > > > Sent: Thursday, October 10, 2019 2:34:15 PM > > > To: Parquet Dev > > > Cc: Raoofy, Amir; Karlstetter, Roman > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet > > > > > > Dear Ryan Blue and other Parquet developers, > > > > > > I tested Ryan's proposal for modifying the encoding. > > > > > > The short answer is that it doesn't perform well in my tests. The > > encoding, code and results can be viewed below. > > > > > > > > > The current implementation only handles 32-bit IEEE754 floats in the > > following way: > > > > > > 1. For each block of 128 values, the min and max is computed for the > > exponent > > > 2. The number of bits for the exponent RLE is computed as > > ceil(log2((max - min + 1))). The sign bit uses 1 bit. > > > 3. The sign, exponent and 23 remaining mantissa bits are extracted. > > > 4. One RLE encoder is used for the sign and one for the exponent. > > > A new RLE encoder for the exponent is created if the block requires less > > or more bits than the number of bits used for the current encoder. > > > The 23 mantissa bits are divided into three streams. (Not sure whether > > this is strictly a good idea). > > > 5. Also, for each 128 values we need to store 2 bytes: the min value > > and the number of bits used by the RLE. > > > > > > > > > I did not implement a decoder and have not added unit tests to guarantee > > that the implementation is sound. > > > > > > Ryan, can you please review the implementation as to whether it > > corresponds to what you had in mind? > > > Check FPByteStreamSplitComponentRLEEncoder. Please do not focus on the > > code quality - this is only a quick hack. > > > > > https://github.com/martinradev/arrow/commit/7c1cc8b2a5ff9e6288c615c2acc44b1dce1bd773#diff-47fe879cb9baad6c633c55f0a34a09c3R979 > > > > > > In my benchmark tests I compared the following scenarios: > > > > > > * dictionary > > > * plain encoding + ZSTD > > > * the StreamSplit+RLE implementation above > > > * the StreamSplit+RLE implementation above + ZSTD > > > * the original BYTE_STREAM_SPLIT proposal + ZSTD > > > > > > Size of parquet files, measured in MB, for each combination: > > > > > > Plain Plain+ZSTD Dict Split+RLE Split+RLE+ZSTD > > StreamSplit+ZSTD > > > msg_bt.sp > > > 128 112 128 113 93 84 > > > msg_lu.sp > > > 93 87 94 80 73 67 > > > msg_sp.sp > > > 139 125 139 123 96 88 > > > msg_sweep3d.sp > > > 60 19 60 47 40 13 > > > num_brain.sp > > > 68 60 69 55 54 49 > > > num_comet.sp > > > 52 45 50 46 42 38 > > > num_control.sp > > > 77 71 77 70 69 63 > > > num_plasma.sp > > > 17 2 8 16 8 1 > > > obs_error.sp > > > 30 24 30 25 22 21 > > > obs_info.sp > > > 10 8 10 8 7 7 > > > obs_spitzer.sp > > > 95 83 95 99 82 72 > > > obs_temp.sp > > > 20 18 20 18 18 16 > > > > > > > > > I have not tested the encoding on any of the other test data we have > > since they contain also FP64 columns. I did not add support for FP64 in the > > StreamSplit+RLE encoding to save on time and also because I do not expect > > much improvement. > > > > > > > > > From the results it can be seen that the proposed StreamSplit+RLE > > encoding does not improve results. > > > > > > Using RLE for the sign bits and exponent obfuscates the input and the > > compression step cannot fully take advantage of repetitions in the data > > since they were removed from the RLE step. Repetitive patterns are replaced > > by the RLE bits which likely do not compress very well. > > > > > > Also, GZIP/ZSTD handle repetitions in data better on average. For > > example, GZIP's Deflate algorithm can encode a long run length with 3 > > bytes(not sure whether it can be less?) for the raw data + 3 bytes for the > > reference ( 8 bits + 15 bits + 2 bits ). > > > > > > Now, the RLE step might produce better results for long runs of the same > > value. However, compressors also handles more complex cases when we have a > > pattern in the data which doesn't necessary have a long run length. Also, > > compressors like GZIP/ZSTD often do entropy-coding-aware parsing ( > > http://cbloomrants.blogspot.com/2008/10/10-10-08-7_10.html , > cbloom rants: 10-10-08 : On LZ Optimal > Parsing<http://cbloomrants.blogspot.com/2008/10/10-10-08-7_10.html> > cbloomrants.blogspot.com > 10-10-08 On LZ Optimal Parsing So the LZ-Huffman I've done for RAD/Oodle does > some new stuff with optimal parsing. I want to write... > > > > > https://martinradev.github.io/jekyll/update/2019/05/29/writing-a-pe32-x86-exe-packer.html > > ). > > > > > > The proposed RLE encoding+ZSTD is slower than BYTE_STREAM_SPLIT+ZSTD. > > The reason is that BYTE_STREAM_SPLIT only does a scattered memcpy where as > > the new RLE encoding has to extract the components and run through each > > block of 128 values twice - once to compute min and max, and once to > > encode. There's also the overhead of using the Arrow RLEEncoder. > > > > > > Ryan and other folks, can you provide feedback? Does the implementation > > look reasonable to you? > > > > > > Can somebody please work with me on this new encoding? There has been > > some interest and some discussions but it hasn't been pushed likely due to > > work around the current release. > > > For a bit more discussions and results, please refer to: > > > Recent benchmark of Arrow implementation: > > https://github.com/martinradev/arrow-fp-compression-bench > > > Separate report: > > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing > [https://lh3.googleusercontent.com/0Nyayc6yUgH07IH12mpd3FJ8OZ7MX282uaxarQ0ffc5sJT_-hiMR5aw60Yg=w1200-h630-p]<https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing> > > report.pdf - Google > Drive<https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing> > drive.google.com > > > > > > > > > Regards, > > > Martin > > > > > > > > > > > > ________________________________ > > > From: Ryan Blue <[email protected]> > > > Sent: Thursday, September 19, 2019 7:54 PM > > > To: Radev, Martin > > > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet > > > > > > Sounds good, thanks for working on this! > > > > > > On Thu, Sep 19, 2019 at 6:10 AM Radev, Martin <[email protected] > > <mailto:[email protected]>> wrote: > > > > > > Hello Ryan, > > > > > > > > > we decided that it would be beneficial to try out your proposal. > > > > > > > > > I will look into it and provide measurements on the compression ratio > > and speed. > > > > > > > > > Regards, > > > > > > Martin > > > > > > ________________________________ > > > From: Ryan Blue <[email protected]<mailto:[email protected]>> > > > Sent: Saturday, September 14, 2019 2:23:20 AM > > > To: Radev, Martin > > > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet > > > > > > > Using RLE for the sign, exponents and the top-most mantissa bytes can > > help when data is repetitive and make it worse for other. > > > > > > I agree. But we use RLE in similar cases because we do tend to have runs > > of values, and values that fit in a fixed number of bits. Exponents and > > sign bits would probably fit this model extremely well most of the time if > > you have similar floating point values or sorted values. It would be really > > interesting to see how well this performs in comparison to the compression > > tests you've already done. For mantissa bits, I agree it wouldn't be worth > > encoding first. > > > > > > On Thu, Sep 12, 2019 at 2:56 AM Radev, Martin <[email protected] > > <mailto:[email protected]>> wrote: > > > > > > Hello Ryan, Wes and other parquet devs, > > > > > > > > > thanks for the response. I was away on vacation and that's why I am > > answering just now. > > > > > > > > > > whether you think adding run-length encoding to any of the byte > > streams would be beneficial before applying Zstd. > > > The short answer is "only for some cases but it will make it worse in > > both compression ratio and speed for other". > > > > > > Our initial investigation also separated the sign, exponent and mantissa > > into separate streams. > > > > > > The encoding was the following assuming 32-bit IEEE754: > > > > > > - stream of sign bits > > > > > > - stream of exponents bits. Conveniently the exponent for a 32-bit > > IEEE754 number is 8 bits. > > > > > > - separate the remaining 23 bits into four streams of 8, 8, 7 bits. An > > extra zero bit is added to the block which has only seven bits. This was > > done since zstd, zlib, etc work at a byte granularity and we would want > > repetitions to happen at such. > > > > > > For 64-bit IEEE754 even more padding has to be added since the exponent > > is 11 bits and the mantissa is 52 bits. Thus, we have to add 5 more > > exponent bits and 4 more mantissa bits to keep repetitions at a byte > > granularity. My original report shows results for when the floating-point > > values are split at a component granularity. Report is here: > > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view > > > Results are just slightly better in terms of compression ratio for some > > tests but compression and decompression speed is expectedly worse. The > > reason is that splitting a value is somewhat more complex. We need to keep > > a stream of bits for the signs, keep track of when a byte in the stream is > > exhausted, do bit manipulation to extract components, etc. This is also the > > reason why I preferred to go with the byte-wise decomposition of the > > values. It's faster and the compression ratio is just slightly worse for > > some of the tests. > > > > > > > > > Using RLE for the sign, exponents and the top-most mantissa bytes can > > help when data is repetitive and make it worse for other. I suppose using > > one of the compressors yields a better compression ratio on average. Also, > > this can again make encoding and decoding slower. > > > > > > > > > The design of the BYTE_STREAM_SPLIT encoding had in mind two things: > > > > > > - It would only make data more compressible and leave compression to the > > codec in use. > > > This leaves the complexity to the codec and choice of > > speed/compression ratio to the user. > > > > > > - It should be fast. > > > There's an extra compression step so preferably there's very little > > latency before it. > > > > > > @Wes, can you have a look? > > > > > > More opinions are welcome. > > > > > > If you have floating point data available, I would be very happy to > > examine whether this approach offers benefit for you. > > > > > > > > > Regards, > > > > > > Martin > > > > > > ________________________________ > > > From: Ryan Blue <[email protected]> > > > Sent: Tuesday, September 3, 2019 11:51:46 PM > > > To: Parquet Dev > > > Cc: Raoofy, Amir; Karlstetter, Roman > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet > > > > > > Hi Martin, > > > > > > Thanks for taking a look at this! I agree that the approach here looks > > > promising. We've had occasional requests for lossy floating point > > > compression in the past, so it would be good to add this. > > > > > > I did some work in this area a few years ago that is similar and I'd like > > > to hear what you think about that approach compared to this one. That > > work > > > was based on the same observation, that the main problem is the mantissa, > > > while exponents tend to compress well. What I did was take the exponent > > and > > > mantissa and encode each separately, like the component encoding in your > > > test. But to encode each stream, I used Parquet's RLE encoder instead of > > > just applying compression. This seemed to work well for exponents and > > sign > > > bits, but probably isn't worth the cost for mantissa bits. It could also > > be > > > interesting to test a separate stream for sign bits. > > > > > > I guess what I'd like to hear your take on is whether you think adding > > > run-length encoding to any of the byte streams would be beneficial before > > > applying Zstd. > > > > > > Thanks! > > > > > > rb > > > > > > On Tue, Sep 3, 2019 at 12:30 PM Wes McKinney <[email protected] > > <mailto:[email protected]>> wrote: > > > > > > > I'm interested in this. I have been busy the last couple of weeks so > > have > > > > not been able to take a closer look. I will try to give some feedback > > this > > > > week. > > > > > > > > Thanks > > > > > > > > On Tue, Sep 3, 2019, 2:17 PM Radev, Martin <[email protected] > > <mailto:[email protected]>> wrote: > > > > > > > > > Hello all, > > > > > > > > > > > > > > > thank you Julien for the interest. > > > > > > > > > > > > > > > Could other people, part of Apache Parquet, share their opinions? > > > > > > > > > > Do you have your own data which you would like to use for testing > > the new > > > > > encoding? > > > > > > > > > > > > > > > Regards, > > > > > > > > > > Martin > > > > > > > > > > ________________________________ > > > > > From: Julien Le Dem <[email protected]> > > > > > Sent: Friday, August 30, 2019 2:38:37 AM > > > > > To: [email protected]<mailto:[email protected]> > > > > > Cc: Raoofy, Amir; Karlstetter, Roman > > > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet > > > > > > > > > > I think this looks promising to me. At first glance it seems > > combining > > > > > simplicity and efficiency. > > > > > I'd like to hear more from other members of the PMC. > > > > > > > > > > On Tue, Aug 27, 2019 at 5:30 AM Radev, Martin <[email protected] > > <mailto:[email protected]>> > > > > wrote: > > > > > > > > > > > Dear all, > > > > > > > > > > > > > > > > > > there was some earlier discussion on adding a new encoding for > > better > > > > > > compression of FP32 and FP64 data. > > > > > > > > > > > > > > > > > > The pull request which extends the format is here: > > > > > > https://github.com/apache/parquet-format/pull/144 > > > > > > The change has one approval from earlier from Zoltan. > > > > > > > > > > > > > > > > > > The results from an investigation on compression ratio and speed > > with > > > > the > > > > > > new encoding vs other encodings is available here: > > > > > > https://github.com/martinradev/arrow-fp-compression-bench > > > > > > It is visible that for many tests the new encoding performs better > > in > > > > > > compression ratio and in some cases in speed. The improvements in > > > > > > compression speed come from the fact that the new format can > > > > potentially > > > > > > lead to a faster parsing for some compressors like GZIP. > > > > > > > > > > > > > > > > > > An earlier report which examines other FP compressors (fpzip, spdp, > > > > fpc, > > > > > > zfp, sz) and new potential encodings is available here: > > > > > > > > > > > > > > > > > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing > > > > > > The report also covers lossy compression but the BYTE_STREAM_SPLIT > > > > > > encoding only has the focus of lossless compression. > > > > > > > > > > > > > > > > > > Can we have a vote? > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > Martin > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Ryan Blue > > > Software Engineer > > > Netflix > > > > > > > > > -- > > > Ryan Blue > > > Software Engineer > > > Netflix > > > > > > > > > -- > > > Ryan Blue > > > Software Engineer > > > Netflix > > > > > -- > Ryan Blue > Software Engineer > Netflix
