+1 for adding BYTE_STREAM_SPLIT encoding to parquet-format. On Tue, Nov 5, 2019 at 11:22 PM Wes McKinney <wesmck...@gmail.com> wrote:
> +1 from me on adding the FP encoding > > On Sat, Nov 2, 2019 at 4:51 AM Radev, Martin <martin.ra...@tum.de> wrote: > > > > Hello all, > > > > > > thanks for the vote Ryan and to Wes for the feedback. > > > > > > The concern with regards to adding more complex features in the Parquet > spec is valid. > > > > However, the proposed encoding is very simple and I already have > unpolished patches for both parquet-mr and arrow. > > > > In its design I purposely opted for something simple to guarantee 1) > good compression speed and 2) ease of implementation. > > > > > > On the topic of testing, I added four more test cases which were taken > from here<https://sdrbench.github.io/>. I also added the size in MB of > all test case and entropy per element. > > > > Having the entropy reported helps show that the encoding performs better > than any other option for high-entropy data and not so well for low-entropy > data. > > > > > > I would be happy to receive some more feedback and votes. > > > > > > Kind regards, > > > > Martin > > > > ________________________________ > > From: Ryan Blue <rb...@netflix.com.INVALID> > > Sent: Friday, November 1, 2019 6:28 PM > > To: Parquet Dev > > Cc: Raoofy, Amir; Karlstetter, Roman > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet > > > > I'm +1 for adding the definition of BYTE_STREAM_SPLIT to the format. > Looks > > like it is simple and performs well. We could use a good floating point > > encoding. > > > > I don't think I agree that differences in features between the Java and > CPP > > implementations should hold back new work. It would be great to have more > > testing and validation, as well as more thorough implementations. But I > > don't think we shouldn't accept contributions like this because of those > > concerns. > > > > On Fri, Nov 1, 2019 at 9:27 AM Wes McKinney <wesmck...@gmail.com> wrote: > > > > > I have to say I'm struggling with piling more things into the Parquet > > > specification when we already have a significant implementation > > > shortfall in other areas. LZ4 is still not properly implemented for > > > example, and then there is the question of the V2 encodings and data > > > page formats. > > > > > > I'm generally in favor of adding more efficient storage of floating > > > point data, but will it actually be implemented broadly? Parquet as a > > > format already has become an "implementation swamp" where any two > > > implementations may not be compatible with each other, particularly in > > > consideration of more advanced features. > > > > > > For a single organization using a single implementation, having > > > advanced features may be useful, so I see the benefits to users that > > > tightly control what code and what settings to use. > > > > > > On Thu, Oct 31, 2019 at 3:51 AM Radev, Martin <martin.ra...@tum.de> > wrote: > > > > > > > > Dear all, > > > > > > > > > > > > would there be any interest in reviewing the BYTE_STREAM_SPLIT > encoding? > > > > > > > > Please feel free to contact me directly if you need help or would > like > > > to provide more test data. > > > > > > > > > > > > Results for the encoding based on the implementation in Arrow are > here: > > > https://github.com/martinradev/arrow-fp-compression-bench > > > > Patch to Arrow is here: > > > > https://github.com/martinradev/arrow/commit/10de1e0f8a513b742edddeb6ba0d553617b1aa49 > > > > > > > > > > > > The new encoding combined with a compressor performs better than any > of > > > the other alternatives for data where there is little change in the > > > upper-most bytes of fp32 and fp64 values. My early experiments also > show > > > that this encoding+zstd performs better on average than any of the > > > specialized floating-point lossless compressors like fpc, spdp, zfp. > > > > > > > > > > > > Regards, > > > > > > > > Martin > > > > > > > > ________________________________ > > > > From: Radev, Martin <martin.ra...@tum.de> > > > > Sent: Thursday, October 10, 2019 2:34:15 PM > > > > To: Parquet Dev > > > > Cc: Raoofy, Amir; Karlstetter, Roman > > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet > > > > > > > > Dear Ryan Blue and other Parquet developers, > > > > > > > > I tested Ryan's proposal for modifying the encoding. > > > > > > > > The short answer is that it doesn't perform well in my tests. The > > > encoding, code and results can be viewed below. > > > > > > > > > > > > The current implementation only handles 32-bit IEEE754 floats in the > > > following way: > > > > > > > > 1. For each block of 128 values, the min and max is computed for > the > > > exponent > > > > 2. The number of bits for the exponent RLE is computed as > > > ceil(log2((max - min + 1))). The sign bit uses 1 bit. > > > > 3. The sign, exponent and 23 remaining mantissa bits are > extracted. > > > > 4. One RLE encoder is used for the sign and one for the exponent. > > > > A new RLE encoder for the exponent is created if the block requires > less > > > or more bits than the number of bits used for the current encoder. > > > > The 23 mantissa bits are divided into three streams. (Not sure > whether > > > this is strictly a good idea). > > > > 5. Also, for each 128 values we need to store 2 bytes: the min > value > > > and the number of bits used by the RLE. > > > > > > > > > > > > I did not implement a decoder and have not added unit tests to > guarantee > > > that the implementation is sound. > > > > > > > > Ryan, can you please review the implementation as to whether it > > > corresponds to what you had in mind? > > > > Check FPByteStreamSplitComponentRLEEncoder. Please do not focus on > the > > > code quality - this is only a quick hack. > > > > > > > > https://github.com/martinradev/arrow/commit/7c1cc8b2a5ff9e6288c615c2acc44b1dce1bd773#diff-47fe879cb9baad6c633c55f0a34a09c3R979 > > > > > > > > In my benchmark tests I compared the following scenarios: > > > > > > > > * dictionary > > > > * plain encoding + ZSTD > > > > * the StreamSplit+RLE implementation above > > > > * the StreamSplit+RLE implementation above + ZSTD > > > > * the original BYTE_STREAM_SPLIT proposal + ZSTD > > > > > > > > Size of parquet files, measured in MB, for each combination: > > > > > > > > Plain Plain+ZSTD Dict Split+RLE > Split+RLE+ZSTD > > > StreamSplit+ZSTD > > > > msg_bt.sp > > > > 128 112 128 113 93 84 > > > > msg_lu.sp > > > > 93 87 94 80 73 67 > > > > msg_sp.sp > > > > 139 125 139 123 96 88 > > > > msg_sweep3d.sp > > > > 60 19 60 47 40 13 > > > > num_brain.sp > > > > 68 60 69 55 54 49 > > > > num_comet.sp > > > > 52 45 50 46 42 38 > > > > num_control.sp > > > > 77 71 77 70 69 63 > > > > num_plasma.sp > > > > 17 2 8 16 8 1 > > > > obs_error.sp > > > > 30 24 30 25 22 21 > > > > obs_info.sp > > > > 10 8 10 8 7 7 > > > > obs_spitzer.sp > > > > 95 83 95 99 82 72 > > > > obs_temp.sp > > > > 20 18 20 18 18 16 > > > > > > > > > > > > I have not tested the encoding on any of the other test data we have > > > since they contain also FP64 columns. I did not add support for FP64 > in the > > > StreamSplit+RLE encoding to save on time and also because I do not > expect > > > much improvement. > > > > > > > > > > > > From the results it can be seen that the proposed StreamSplit+RLE > > > encoding does not improve results. > > > > > > > > Using RLE for the sign bits and exponent obfuscates the input and the > > > compression step cannot fully take advantage of repetitions in the data > > > since they were removed from the RLE step. Repetitive patterns are > replaced > > > by the RLE bits which likely do not compress very well. > > > > > > > > Also, GZIP/ZSTD handle repetitions in data better on average. For > > > example, GZIP's Deflate algorithm can encode a long run length with 3 > > > bytes(not sure whether it can be less?) for the raw data + 3 bytes for > the > > > reference ( 8 bits + 15 bits + 2 bits ). > > > > > > > > Now, the RLE step might produce better results for long runs of the > same > > > value. However, compressors also handles more complex cases when we > have a > > > pattern in the data which doesn't necessary have a long run length. > Also, > > > compressors like GZIP/ZSTD often do entropy-coding-aware parsing ( > > > http://cbloomrants.blogspot.com/2008/10/10-10-08-7_10.html , > > cbloom rants: 10-10-08 : On LZ Optimal Parsing< > http://cbloomrants.blogspot.com/2008/10/10-10-08-7_10.html> > > cbloomrants.blogspot.com > > 10-10-08 On LZ Optimal Parsing So the LZ-Huffman I've done for RAD/Oodle > does some new stuff with optimal parsing. I want to write... > > > > > > > > > > https://martinradev.github.io/jekyll/update/2019/05/29/writing-a-pe32-x86-exe-packer.html > > > ). > > > > > > > > The proposed RLE encoding+ZSTD is slower than BYTE_STREAM_SPLIT+ZSTD. > > > The reason is that BYTE_STREAM_SPLIT only does a scattered memcpy > where as > > > the new RLE encoding has to extract the components and run through each > > > block of 128 values twice - once to compute min and max, and once to > > > encode. There's also the overhead of using the Arrow RLEEncoder. > > > > > > > > Ryan and other folks, can you provide feedback? Does the > implementation > > > look reasonable to you? > > > > > > > > Can somebody please work with me on this new encoding? There has been > > > some interest and some discussions but it hasn't been pushed likely > due to > > > work around the current release. > > > > For a bit more discussions and results, please refer to: > > > > Recent benchmark of Arrow implementation: > > > https://github.com/martinradev/arrow-fp-compression-bench > > > > Separate report: > > > > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing > > [ > https://lh3.googleusercontent.com/0Nyayc6yUgH07IH12mpd3FJ8OZ7MX282uaxarQ0ffc5sJT_-hiMR5aw60Yg=w1200-h630-p > ]< > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing > > > > > > report.pdf - Google Drive< > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing > > > > drive.google.com > > > > > > > > > > > > > > Regards, > > > > Martin > > > > > > > > > > > > > > > > ________________________________ > > > > From: Ryan Blue <rb...@netflix.com> > > > > Sent: Thursday, September 19, 2019 7:54 PM > > > > To: Radev, Martin > > > > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman > > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet > > > > > > > > Sounds good, thanks for working on this! > > > > > > > > On Thu, Sep 19, 2019 at 6:10 AM Radev, Martin <martin.ra...@tum.de > > > <mailto:martin.ra...@tum.de>> wrote: > > > > > > > > Hello Ryan, > > > > > > > > > > > > we decided that it would be beneficial to try out your proposal. > > > > > > > > > > > > I will look into it and provide measurements on the compression ratio > > > and speed. > > > > > > > > > > > > Regards, > > > > > > > > Martin > > > > > > > > ________________________________ > > > > From: Ryan Blue <rb...@netflix.com<mailto:rb...@netflix.com>> > > > > Sent: Saturday, September 14, 2019 2:23:20 AM > > > > To: Radev, Martin > > > > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman > > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet > > > > > > > > > Using RLE for the sign, exponents and the top-most mantissa bytes > can > > > help when data is repetitive and make it worse for other. > > > > > > > > I agree. But we use RLE in similar cases because we do tend to have > runs > > > of values, and values that fit in a fixed number of bits. Exponents and > > > sign bits would probably fit this model extremely well most of the > time if > > > you have similar floating point values or sorted values. It would be > really > > > interesting to see how well this performs in comparison to the > compression > > > tests you've already done. For mantissa bits, I agree it wouldn't be > worth > > > encoding first. > > > > > > > > On Thu, Sep 12, 2019 at 2:56 AM Radev, Martin <martin.ra...@tum.de > > > <mailto:martin.ra...@tum.de>> wrote: > > > > > > > > Hello Ryan, Wes and other parquet devs, > > > > > > > > > > > > thanks for the response. I was away on vacation and that's why I am > > > answering just now. > > > > > > > > > > > > > whether you think adding run-length encoding to any of the byte > > > streams would be beneficial before applying Zstd. > > > > The short answer is "only for some cases but it will make it worse in > > > both compression ratio and speed for other". > > > > > > > > Our initial investigation also separated the sign, exponent and > mantissa > > > into separate streams. > > > > > > > > The encoding was the following assuming 32-bit IEEE754: > > > > > > > > - stream of sign bits > > > > > > > > - stream of exponents bits. Conveniently the exponent for a 32-bit > > > IEEE754 number is 8 bits. > > > > > > > > - separate the remaining 23 bits into four streams of 8, 8, 7 bits. > An > > > extra zero bit is added to the block which has only seven bits. This > was > > > done since zstd, zlib, etc work at a byte granularity and we would want > > > repetitions to happen at such. > > > > > > > > For 64-bit IEEE754 even more padding has to be added since the > exponent > > > is 11 bits and the mantissa is 52 bits. Thus, we have to add 5 more > > > exponent bits and 4 more mantissa bits to keep repetitions at a byte > > > granularity. My original report shows results for when the > floating-point > > > values are split at a component granularity. Report is here: > > > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view > > > > Results are just slightly better in terms of compression ratio for > some > > > tests but compression and decompression speed is expectedly worse. The > > > reason is that splitting a value is somewhat more complex. We need to > keep > > > a stream of bits for the signs, keep track of when a byte in the > stream is > > > exhausted, do bit manipulation to extract components, etc. This is > also the > > > reason why I preferred to go with the byte-wise decomposition of the > > > values. It's faster and the compression ratio is just slightly worse > for > > > some of the tests. > > > > > > > > > > > > Using RLE for the sign, exponents and the top-most mantissa bytes can > > > help when data is repetitive and make it worse for other. I suppose > using > > > one of the compressors yields a better compression ratio on average. > Also, > > > this can again make encoding and decoding slower. > > > > > > > > > > > > The design of the BYTE_STREAM_SPLIT encoding had in mind two things: > > > > > > > > - It would only make data more compressible and leave compression to > the > > > codec in use. > > > > This leaves the complexity to the codec and choice of > > > speed/compression ratio to the user. > > > > > > > > - It should be fast. > > > > There's an extra compression step so preferably there's very little > > > latency before it. > > > > > > > > @Wes, can you have a look? > > > > > > > > More opinions are welcome. > > > > > > > > If you have floating point data available, I would be very happy to > > > examine whether this approach offers benefit for you. > > > > > > > > > > > > Regards, > > > > > > > > Martin > > > > > > > > ________________________________ > > > > From: Ryan Blue <rb...@netflix.com.INVALID> > > > > Sent: Tuesday, September 3, 2019 11:51:46 PM > > > > To: Parquet Dev > > > > Cc: Raoofy, Amir; Karlstetter, Roman > > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet > > > > > > > > Hi Martin, > > > > > > > > Thanks for taking a look at this! I agree that the approach here > looks > > > > promising. We've had occasional requests for lossy floating point > > > > compression in the past, so it would be good to add this. > > > > > > > > I did some work in this area a few years ago that is similar and I'd > like > > > > to hear what you think about that approach compared to this one. That > > > work > > > > was based on the same observation, that the main problem is the > mantissa, > > > > while exponents tend to compress well. What I did was take the > exponent > > > and > > > > mantissa and encode each separately, like the component encoding in > your > > > > test. But to encode each stream, I used Parquet's RLE encoder > instead of > > > > just applying compression. This seemed to work well for exponents and > > > sign > > > > bits, but probably isn't worth the cost for mantissa bits. It could > also > > > be > > > > interesting to test a separate stream for sign bits. > > > > > > > > I guess what I'd like to hear your take on is whether you think > adding > > > > run-length encoding to any of the byte streams would be beneficial > before > > > > applying Zstd. > > > > > > > > Thanks! > > > > > > > > rb > > > > > > > > On Tue, Sep 3, 2019 at 12:30 PM Wes McKinney <wesmck...@gmail.com > > > <mailto:wesmck...@gmail.com>> wrote: > > > > > > > > > I'm interested in this. I have been busy the last couple of weeks > so > > > have > > > > > not been able to take a closer look. I will try to give some > feedback > > > this > > > > > week. > > > > > > > > > > Thanks > > > > > > > > > > On Tue, Sep 3, 2019, 2:17 PM Radev, Martin <martin.ra...@tum.de > > > <mailto:martin.ra...@tum.de>> wrote: > > > > > > > > > > > Hello all, > > > > > > > > > > > > > > > > > > thank you Julien for the interest. > > > > > > > > > > > > > > > > > > Could other people, part of Apache Parquet, share their opinions? > > > > > > > > > > > > Do you have your own data which you would like to use for testing > > > the new > > > > > > encoding? > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > Martin > > > > > > > > > > > > ________________________________ > > > > > > From: Julien Le Dem <julien.le...@wework.com.INVALID> > > > > > > Sent: Friday, August 30, 2019 2:38:37 AM > > > > > > To: dev@parquet.apache.org<mailto:dev@parquet.apache.org> > > > > > > Cc: Raoofy, Amir; Karlstetter, Roman > > > > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache > Parquet > > > > > > > > > > > > I think this looks promising to me. At first glance it seems > > > combining > > > > > > simplicity and efficiency. > > > > > > I'd like to hear more from other members of the PMC. > > > > > > > > > > > > On Tue, Aug 27, 2019 at 5:30 AM Radev, Martin < > martin.ra...@tum.de > > > <mailto:martin.ra...@tum.de>> > > > > > wrote: > > > > > > > > > > > > > Dear all, > > > > > > > > > > > > > > > > > > > > > there was some earlier discussion on adding a new encoding for > > > better > > > > > > > compression of FP32 and FP64 data. > > > > > > > > > > > > > > > > > > > > > The pull request which extends the format is here: > > > > > > > https://github.com/apache/parquet-format/pull/144 > > > > > > > The change has one approval from earlier from Zoltan. > > > > > > > > > > > > > > > > > > > > > The results from an investigation on compression ratio and > speed > > > with > > > > > the > > > > > > > new encoding vs other encodings is available here: > > > > > > > https://github.com/martinradev/arrow-fp-compression-bench > > > > > > > It is visible that for many tests the new encoding performs > better > > > in > > > > > > > compression ratio and in some cases in speed. The improvements > in > > > > > > > compression speed come from the fact that the new format can > > > > > potentially > > > > > > > lead to a faster parsing for some compressors like GZIP. > > > > > > > > > > > > > > > > > > > > > An earlier report which examines other FP compressors (fpzip, > spdp, > > > > > fpc, > > > > > > > zfp, sz) and new potential encodings is available here: > > > > > > > > > > > > > > > > > > > > > > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing > > > > > > > The report also covers lossy compression but the > BYTE_STREAM_SPLIT > > > > > > > encoding only has the focus of lossless compression. > > > > > > > > > > > > > > > > > > > > > Can we have a vote? > > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > Martin > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Ryan Blue > > > > Software Engineer > > > > Netflix > > > > > > > > > > > > -- > > > > Ryan Blue > > > > Software Engineer > > > > Netflix > > > > > > > > > > > > -- > > > > Ryan Blue > > > > Software Engineer > > > > Netflix > > > > > > > > > -- > > Ryan Blue > > Software Engineer > > Netflix >